xxxxxxxxxx<P> <img src="https://i.ibb.co/gyNf19D/nhslogo.png" alt="nhslogo" border="0" width="100" align="right"><font size="6"><b> CS4132 Data Analytics</b> </font>
CS4132 Data Analytics
xxxxxxxxxx# "Indie Games: The Underdogs of Gaming" by Ayden Angxxxxxxxxxx# Table of Content (with relevant hyperlinks to sections)xxxxxxxxxx- [Motivation & Background](#Motivation-and-Background)- [Summary of Research Questions & Results](#Summary-of-Research-Questions-&-Results)- [Dataset](#Dataset)- [Methodology](#Methodology) - [Data Acquisition](#Data-Acquisition) - [Data Cleaning](#Data-Cleaning) - [steam_df](#steam_df) - [avg_players_df & peak_players_df](#avg_players_df-&-peak_players_df) - [itchio_df](#itchio_df) - [EDA](#EDA) - [Q1: How popular are indie games compared to AAA games?](#Q1_EDA) - [Q2: What are the major differences between indie games and AAA games? ](#Q2_EDA) - [Q3: What factors contribute to the success of an indie game?](#Q3_EDA) - [Q4: What are factors that indie game developers have to consider when developing an indie game?](#Q4_EDA)- [Results Findings & Conclusion](#Results-Findings-&-Conclusion) - [Q1: How popular are indie games compared to AAA games?](#Q1) - [Q2: What are the major differences between indie games and AAA games? ](#Q2) - [Q3: What factors contribute to the success of an indie game?](#Q3) - [Q4: What are factors that indie game developers have to consider when developing an indie game?](#Q4) - [Conclusion](#Conclusion)- [Recommendations or Further Works](#Recommendations-or-Further-Works)- [References](#References)xxxxxxxxxxThe video game industry has been growing for the past couple of years, with many new games being developed and released at all times. Many of these games are developed by big game studios, which have hundreds of people and have massive amounts of funding, allowing them to dedicate a lot of financial resources and manpower to developing their games. These games are what people call Triple-A games, or AAA games, games that are distributed by large companies that are well-known, such as Sony and Microsoft.However, there has been some negative stigma that has started to surround AAA games. Due to a variety or reasons, such as microtransactions, incomplete buggy released games and the lack of risk-taking, many of the AAA games that have been recently released have been widely regarded as being poorer in quality and being more lackluster to play. This is largely due to the focus of these AAA game studios becoming more on profits than creating a truly enjoyable game, putting quantity over quality.Since AAA games are on the decline, this leaves a gap in the video game industry to be filled. This is where indie games come in. Indie games are games that are developed by small groups of individuals without the technical and financial resources that a AAA game studio has. Due to this, the scope of indie games are usually much more limited than that of AAA games, and it can be hard for indie game developers to develop a game without much support from a large team. Despite this, however, many indie games are extremely high in quality and can be compared to that of AAA games, perhaps even higher in quality than AAA games. indie game developers also have the freedom to create whatever game idea they want, something that AAA game studios do not have, resulting in a greater variety of indie games.The popularity of indie games is quickly rising in popularity, not only among consumers, but also among game developers. Gamers are starting to see the appeal of indie games, increasing the demand for them. Many of the games that are sold on Steam, which is one of the most popular video game distribution services, are indie games, showing that many gamers want to purchase and play indie games. Indie games are also encouraging more individuals to begin their journey in game development, with more tools and services to help them develop indie games becoming available recently. One of these services is itch.io, which is a website where anybody can host and sell their own indie games independently. While it is not as popular as Steam is as a video game distribution service, it is focused completely on indie games, allowing indie developers to publish and sell their own passion projects online.Therefore, I would like to analyse the rise in popularity in indie games for both consumers and game developers, as well as how they compare to AAA games.The video game industry has been growing for the past couple of years, with many new games being developed and released at all times. Many of these games are developed by big game studios, which have hundreds of people and have massive amounts of funding, allowing them to dedicate a lot of financial resources and manpower to developing their games. These games are what people call Triple-A games, or AAA games, games that are distributed by large companies that are well-known, such as Sony and Microsoft.
However, there has been some negative stigma that has started to surround AAA games. Due to a variety or reasons, such as microtransactions, incomplete buggy released games and the lack of risk-taking, many of the AAA games that have been recently released have been widely regarded as being poorer in quality and being more lackluster to play. This is largely due to the focus of these AAA game studios becoming more on profits than creating a truly enjoyable game, putting quantity over quality.
Since AAA games are on the decline, this leaves a gap in the video game industry to be filled. This is where indie games come in. Indie games are games that are developed by small groups of individuals without the technical and financial resources that a AAA game studio has. Due to this, the scope of indie games are usually much more limited than that of AAA games, and it can be hard for indie game developers to develop a game without much support from a large team. Despite this, however, many indie games are extremely high in quality and can be compared to that of AAA games, perhaps even higher in quality than AAA games. indie game developers also have the freedom to create whatever game idea they want, something that AAA game studios do not have, resulting in a greater variety of indie games.
The popularity of indie games is quickly rising in popularity, not only among consumers, but also among game developers. Gamers are starting to see the appeal of indie games, increasing the demand for them. Many of the games that are sold on Steam, which is one of the most popular video game distribution services, are indie games, showing that many gamers want to purchase and play indie games. Indie games are also encouraging more individuals to begin their journey in game development, with more tools and services to help them develop indie games becoming available recently. One of these services is itch.io, which is a website where anybody can host and sell their own indie games independently. While it is not as popular as Steam is as a video game distribution service, it is focused completely on indie games, allowing indie developers to publish and sell their own passion projects online.
Therefore, I would like to analyse the rise in popularity in indie games for both consumers and game developers, as well as how they compare to AAA games.
xxxxxxxxxx1. <b>How popular are indie games compared to AAA games?</b>While it is known that indie games are on the rise, it is still unclear if they have they grown to the point that they can overthrow AAA games. It is also unclear when is the definitive point of time when the upsurge of indie gaming truly began. Therefore, I want to compare the popularity of indie games to that of AAA games at different time frames, to see the trends and visualise how indie games are catching up to AAA games.2. <b>What are the major differences between indie games and AAA games?</b>Indie games and AAA games are majorly different, both in development and in gameplay. While AAA games often have a lot of budget and manpower allocated to them, indie games do not have that luxury. As a result, while AAA games can have a larger scope and have more content, indie games have to find other ways to appeal to consumers. Therefore, I want to analyse what are the major differences in the trends of indie games and AAA games.3. <b>What factors contribute to the success of an indie game?</b>While indie gaming as whole has been on the rise, not all indie games are equally popular. There are some that have become as famous as AAA games are, while some have more of a small playerbase. Therefore, I want to analyse what factors affect the popularity of indie games and what allows some indie games to become successful. Perhaps there would be differing trends in indie games when grouped in terms of popularity.4. <b>What are factors that indie game developers have to consider when developing an indie game?</b>Due to indie game developers not having a lot of manpower and financial resources, they often have limited options when developing indie games and would not be able to develop games with very large scopes. This makes indie game development very tough, yet there are still many individuals or small groups of people that are able to successfully develop a finished product. Therefore, I want to find out how indie game developers make the developing of indie games more manageable for them to handle. Perhaps there there some tools and software that are more popular among developers for making indie game development easier, or there are types of games that are more popular to develop than others.xxxxxxxxxx1. https://steamdb.info/stats/gameratings/?allThis website contains the biggest list of steam games that I can scrap. In the HTML of this page, there are a total of 58410 Steam games that I could scrap. I used this to get the IDs, names and distribution of positive and negative reviews of each game.2. https://steamspy.com/api.phpThis is the API link of a website that lists various information about Steam games. I used this API to obtain the range of the possible number of owners for each game, the price of each game, and the average and median playtime of players for each game.3. https://store.steampowered.com/This is the official site of Steam and contains all store pages of every Steam game. I used the store pages of each game to get various details, which are the developers, publishers, release date, languages, genres and tags of each respective game.4. https://steamcharts.com/This website contains statistics of the concurrent players of each Steam game. I used this site to obtain the average concurrent players per month and the peak concurrent players per month for each game.5. https://itch.io/games/top-ratedThis is the official site of itch.io and lists the top-rated itch.io games. I used this site to obtain various information from the store pages of 13968 itch.io games.xxxxxxxxxxAll relevant imports are listed here.All relevant imports are listed here.
xxxxxxxxxximport numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as snsfrom scipy import statsfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom datetime import *import rexxxxxxxxxx<b>all_steam_games_info.csv</b> contains all of the info collected for the 58410 Steam games.- <b>id:</b> The ID of the game.- <b>name:</b> The name of the game.- <b>total_reviews:</b> The total number of reviews of the game.- <b>positive_reviews:</b> The number of positive reviews of the game.- <b>negative_reviews:</b> The number of negative reviews of the game.- <b>rating:</b> The percentage of reviews for the game that are positive.- <b>owners:</b> The estimated number of owners of the game.- <b>min_owners:</b> The minimum estimated number of owners of the game.- <b>max_owners:</b> The maximum estimated number of owners of the game.- <b>avg_playtime:</b> The average playtime of players of the game in hours.- <b>median_playtime:</b> The median playtime of players of the game in hours.- <b>price:</b> The price of the game in dollars.- <b>date:</b> The release date of the game.- <b>developers:</b> The developers of the game. Stored in a list.- <b>publishers:</b> The publishers of the game. Stored in a list.- <b>languages:</b> The available languages that the game is in. Stored in a list.- <b>genres:</b> The genres of the game. Stored in a list.- <b>tags:</b> The tags of the game. Stored in a list.all_steam_games_info.csv contains all of the info collected for the 58410 Steam games.
xxxxxxxxxxsteam_df = pd.read_csv("all_steam_games_info.csv")steam_dfxxxxxxxxxx<b>all_steam_games_avg_players_per_month.csv</b> contains all of the statistics of the average concurrent players per month for each Steam game. The ID of the game is in the leftmost column, while the rest of the columns are the average concurrent players from July 2012 to July 2022.all_steam_games_avg_players_per_month.csv contains all of the statistics of the average concurrent players per month for each Steam game. The ID of the game is in the leftmost column, while the rest of the columns are the average concurrent players from July 2012 to July 2022.
xxxxxxxxxxavg_players_df = pd.read_csv("all_steam_games_avg_players_per_month.csv")avg_players_dfxxxxxxxxxx<b>all_steam_games_peak_players_per_month.csv</b> contains all of the statistics of the peak concurrent players per month for each Steam game. The ID of the game is in the leftmost column, while the rest of the columns are the peak concurrent players from July 2012 to July 2022.all_steam_games_peak_players_per_month.csv contains all of the statistics of the peak concurrent players per month for each Steam game. The ID of the game is in the leftmost column, while the rest of the columns are the peak concurrent players from July 2012 to July 2022.
xxxxxxxxxxpeak_players_df = pd.read_csv("all_steam_games_peak_players_per_month.csv")peak_players_dfxxxxxxxxxx<b>itch.io_games_info.csv</b> contains all of the info collected for the 13968 itch.io games.- <b>Name:</b> The name of the game.- <b>Price:</b> The price of the game.- <b>Platforms:</b> The platforms that the game supports. Stored in a list.- <b>Rating:</b> The rating of the game.- <b>Authors:</b> The developers of the game. Stored in a list.- <b>Genre:</b> The genres of the game. Stored in a list.- <b>Tags:</b> The tags of the game. Stored in a list.- <b>Number of Reviews:</b> The number of reviews of the game.- <b>Made with:</b> The tools and software used in the development of the game. Stored in a list.- <b>Average session:</b> The average playtime of players of the game.- <b>Languages:</b> The available languages that the game is in. Stored in a list.- <b>Inputs:</b> The inputs that the game supports. Stored in a list.- <b>Accessibility:</b> The accessibility options the game provides. Stored in a list.itch.io_games_info.csv contains all of the info collected for the 13968 itch.io games.
xxxxxxxxxxitchio_df = pd.read_csv("itch.io_games_info.csv")itchio_dfxxxxxxxxxxThe first dataset to be cleaned is <b>steam_df</b>.Due to the <b>developers</b>, <b>publishers</b>, <b>languages</b>, <b>genres</b> and <b>tags</b> columns being collected in a list, some of the values in these columns are simply empty lists ("[]").The first dataset to be cleaned is steam_df.
Due to the developers, publishers, languages, genres and tags columns being collected in a list, some of the values in these columns are simply empty lists ("[]").
xxxxxxxxxxempty_lists = ((steam_df.developers == "[]") | (steam_df.publishers == "[]") | (steam_df.languages == "[]") | (steam_df.genres == "[]") | (steam_df.tags == "[]"))steam_df.loc[empty_lists]xxxxxxxxxxThese values can be replaced with NaN values, as they contain no information.These values can be replaced with NaN values, as they contain no information.
xxxxxxxxxxsteam_df = steam_df.replace("[]", np.nan)steam_df.loc[empty_lists]xxxxxxxxxxNext, there are some games that have NaN listed as <b>price</b>. This is not due to these games being free-to-play, as other free-to-play games have 0.00 listed as their price. Rather, this is most likely due to a bug in SteamSpy's API, the API used to collect the data for the prices.Next, there are some games that have NaN listed as price. This is not due to these games being free-to-play, as other free-to-play games have 0.00 listed as their price. Rather, this is most likely due to a bug in SteamSpy's API, the API used to collect the data for the prices.
xxxxxxxxxxsteam_df.loc[steam_df.price.isna()]xxxxxxxxxxThe SteamSpy API was also used to collect the data for the owner numbers, as well as <b>avg_playtime</b> and <b>median_playtime</b>. However, <b>avg_playtime</b> and <b>median_playtime</b> seemed unaffected by this bug, as seen by the games with NaN as their <b>price</b> having non-zero values for <b>avg_playtime</b> and <b>median_playtime</b>. On the other hand, <b>owners</b>, <b>min_owners</b> and <b>max_owners</b> seem to be locked as one value for all of these games.The SteamSpy API was also used to collect the data for the owner numbers, as well as avg_playtime and median_playtime. However, avg_playtime and median_playtime seemed unaffected by this bug, as seen by the games with NaN as their price having non-zero values for avg_playtime and median_playtime. On the other hand, owners, min_owners and max_owners seem to be locked as one value for all of these games.
xxxxxxxxxxsteam_df.loc[steam_df.price.isna()].owners.value_counts()xxxxxxxxxxAs a result, we have no choice but to set the <b>owners</b>, <b>min_owners</b> and <b>max_owners</b> of the affected games as NaN, as these values could be inaccurate and could affect the distribution of the data.As a result, we have no choice but to set the owners, min_owners and max_owners of the affected games as NaN, as these values could be inaccurate and could affect the distribution of the data.
xxxxxxxxxxsteam_df.loc[steam_df.price.isna(), ["owners", "min_owners", "max_owners"]] = np.nansteam_df.loc[steam_df.price.isna()]xxxxxxxxxxThe <b>date</b> column was collected as strings, thus it has to be converted into DateTime format in order to allow for time series analysis.The date column was collected as strings, thus it has to be converted into DateTime format in order to allow for time series analysis.
xxxxxxxxxxsteam_df.loc[steam_df.date == "TBD", "date"] = np.nansteam_df.date = pd.to_datetime(steam_df.date)steam_dfxxxxxxxxxxIn order to compare data between indie and non-indie games, the games have to be classified as either "indie" or "non-indie". To do this, we can refer to the <b>genres</b> and <b>tags</b> of a game, and if either of them include the "Indie" tag, then it is classified as an indie game, else it is a non-indie game. We can create a new column <b>is_indie</b> to store this data.In order to compare data between indie and non-indie games, the games have to be classified as either "indie" or "non-indie". To do this, we can refer to the genres and tags of a game, and if either of them include the "Indie" tag, then it is classified as an indie game, else it is a non-indie game. We can create a new column is_indie to store this data.
xxxxxxxxxxis_indie = ((steam_df.genres.notna() & steam_df.genres.str.contains("Indie")) | (steam_df.tags.notna() & steam_df.tags.str.contains("Indie")))steam_df.insert(2, "is_indie", is_indie)steam_dfxxxxxxxxxxUnfortunately, Steam protects the data for the true number of owners of each Steam game. Therefore, SteamSpy was only able to estimate a range for the number of owners of a Steam game, which is why there are 3 columns for the number of owners, <b>owners</b>, <b>min_owners</b> and <b>max_owners</b>. This means that using <b>owners</b> to estimate popularity and success of a game can be inaccurate, but we can use <b>total_reviews</b> for this exact purpose. However, we can use <b>owners</b>, <b>min_owners</b> and <b>max_owners</b> as a way to sort the games into different categories by popularity, allowing us to compare Steam games of different popularity.Unfortunately, Steam protects the data for the true number of owners of each Steam game. Therefore, SteamSpy was only able to estimate a range for the number of owners of a Steam game, which is why there are 3 columns for the number of owners, owners, min_owners and max_owners. This means that using owners to estimate popularity and success of a game can be inaccurate, but we can use total_reviews for this exact purpose. However, we can use owners, min_owners and max_owners as a way to sort the games into different categories by popularity, allowing us to compare Steam games of different popularity.
xxxxxxxxxxdef owner_binning(row): if row.min_owners >= 1000000: row["owners_binned"] = "Highest" elif row.min_owners >= 100000: row["owners_binned"] = "High" elif row.min_owners >= 20000: row["owners_binned"] = "Low" else: row["owners_binned"] = "Lowest" return rowsteam_df = steam_df.apply(owner_binning, axis="columns")steam_df.loc[steam_df.price.isna(), "owners_binned"] = np.nansteam_dfxxxxxxxxxxThe games has been binned into 4 different categories, Highest, High, Low and Lowest. The size of these bins are unequal and grows exponentially smaller from Lowest to Highest.The games has been binned into 4 different categories, Highest, High, Low and Lowest. The size of these bins are unequal and grows exponentially smaller from Lowest to Highest.
xxxxxxxxxxsteam_df.owners_binned.value_counts()xxxxxxxxxxplt.figure(figsize=(7, 7))plt.pie( steam_df.owners_binned.value_counts(), labels=steam_df.owners_binned.value_counts().index, autopct='%1.1f%%', )plt.show()xxxxxxxxxxFinally, we can add columns for the number of developers, publishers and languages a game has, as columns <b>developers_count</b>, <b>publishers_count</b> and <b>languages_count</b> respectively.Finally, we can add columns for the number of developers, publishers and languages a game has, as columns developers_count, publishers_count and languages_count respectively.
xxxxxxxxxxsteam_df.insert(17, "languages_count", steam_df.languages.str[1:-1].str.split(", ").str.len().astype("Int64"))steam_df.insert(16, "publishers_count", steam_df.publishers.str[1:-1].str.split(", ").str.len().astype("Int64"))steam_df.insert(15, "developers_count", steam_df.developers.str[1:-1].str.split(", ").str.len().astype("Int64"))steam_dfxxxxxxxxxxFinally, we have the summary of <b>steam_df</b>. None of the columns have a lot of NaN values, with the most being 2123, from the <b>price</b> columns and the owner columns. Therefore, NaN values can be dropped if required.Finally, we have the summary of steam_df. None of the columns have a lot of NaN values, with the most being 2123, from the price columns and the owner columns. Therefore, NaN values can be dropped if required.
xxxxxxxxxxsteam_df.info()xxxxxxxxxxFor <b>avg_players_df</b>, there are a lot of NaN values. This is either due to missing data, or due to the game not being released yet at that time period. Regardless, we are able to replace all NaN values with 0.The index has to be changed to the DateTime format in order for time series analysis.The <b>is_indie</b> and <b>owners_binned</b> columns from <b>steam_df</b> can be added to be able to compare indie games and non-indie games, as well as Steam games of different popularity.For avg_players_df, there are a lot of NaN values. This is either due to missing data, or due to the game not being released yet at that time period. Regardless, we are able to replace all NaN values with 0.
The index has to be changed to the DateTime format in order for time series analysis.
The is_indie and owners_binned columns from steam_df can be added to be able to compare indie games and non-indie games, as well as Steam games of different popularity.
xxxxxxxxxxavg_players_df.fillna(0, inplace=True)avg_players_df.columns = ["id"]+list(pd.to_datetime(avg_players_df.columns[1:]))avg_players_df.insert(1, "is_indie", steam_df.is_indie)avg_players_df.insert(2, "owners_binned", steam_df.owners_binned)avg_players_dfxxxxxxxxxxA similar process can be done for <b>peak_players_df</b>.A similar process can be done for peak_players_df.
xxxxxxxxxxpeak_players_df.fillna(0, inplace=True)peak_players_df.columns = ["id"]+list(pd.to_datetime(peak_players_df.columns[1:]))peak_players_df.insert(1, "is_indie", steam_df.is_indie)peak_players_df.insert(2, "owners_binned", steam_df.owners_binned)peak_players_dfxxxxxxxxxxNot as much cleaning has to be done on <b>itchio_df</b> as <b>steam_df</b>.The <b>Price</b> column was collected as strings, thus it has to be converted into floats.Not as much cleaning has to be done on itchio_df as steam_df.
The Price column was collected as strings, thus it has to be converted into floats.
xxxxxxxxxxitchio_df.Price = pd.to_numeric(itchio_df.Price.str[1:], errors='coerce')itchio_dfxxxxxxxxxxSimilar to <b>steam_df</b>, we can add columns for the number of platforms, tools, languages, inputs and accessibilities a game has, as columns <b>platforms_count</b>, <b>tools_count</b>, <b>languages_count</b>, <b>inputs_count</b> and <b>accessibility_count</b> respectively.Similar to steam_df, we can add columns for the number of platforms, tools, languages, inputs and accessibilities a game has, as columns platforms_count, tools_count, languages_count, inputs_count and accessibility_count respectively.
xxxxxxxxxxitchio_df.insert(13, "accessibility_count", itchio_df.Accessibility.str[1:-1].str.split(", ").str.len().astype("Int64"))itchio_df.insert(12, "inputs_count", itchio_df.Inputs.str[1:-1].str.split(", ").str.len().astype("Int64"))itchio_df.insert(11, "languages_count", itchio_df.Languages.str[1:-1].str.split(", ").str.len().astype("Int64"))itchio_df.insert(9, "tools_count", itchio_df.loc[:, "Made with"].str[1:-1].str.split(", ").str.len().astype("Int64"))itchio_df.insert(3, "platforms_count", itchio_df.Platforms.str[1:-1].str.split(", ").str.len().astype("Int64"))itchio_dfxxxxxxxxxxFor NaN values in these columns, they can be replaced with 0.For NaN values in these columns, they can be replaced with 0.
xxxxxxxxxxitchio_df.loc[:, ["platforms_count", "tools_count", "languages_count", "inputs_count", "accessibility_count"]] = itchio_df.loc[:, ["platforms_count", "tools_count", "languages_count", "inputs_count", "accessibility_count"]].fillna(0)itchio_dfxxxxxxxxxxFinally, we have the summary of <b>itchio_df</b>. Unfortunately, there are much more NaN values in <b>itchio_df</b> compared to <b>steam_df</b> in the <b>Made with</b>, <b>Average session</b>, <b>Languages</b>, <b>Inputs</b> and <b>Accessibility</b> columns, as these are the columns for the extra information that not every game page displays. As a result, we cannot drop these NaN values.Finally, we have the summary of itchio_df. Unfortunately, there are much more NaN values in itchio_df compared to steam_df in the Made with, Average session, Languages, Inputs and Accessibility columns, as these are the columns for the extra information that not every game page displays. As a result, we cannot drop these NaN values.
xxxxxxxxxxitchio_df.info()xxxxxxxxxx### Q1: How popular are indie games compared to AAA games? <a id="Q1_EDA"></a>xxxxxxxxxxWe can find the rise in popularity of indie games and non-indie games by plotting the total concurrent players from indie and non-indie games against time. The total concurrent players in a month can be estimated by the sum of the average concurrent players of every game in that month. A rolling average is used to smoothen out the graph.We can find the rise in popularity of indie games and non-indie games by plotting the total concurrent players from indie and non-indie games against time. The total concurrent players in a month can be estimated by the sum of the average concurrent players of every game in that month. A rolling average is used to smoothen out the graph.
xxxxxxxxxxindie = avg_players_df.loc[avg_players_df.is_indie].iloc[:, 3:].sum()non_indie = avg_players_df.loc[~avg_players_df.is_indie].iloc[:, 3:].sum()total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie})pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Total concurrent players from indie games and non-indie games against time", ylabel="Number of concurrent players", xlabel="Year")plt.annotate(text="Non-indie games spike in 2018", xy=(datetime(2018, 3, 1), 3.8e6), xytext=(datetime(2018, 3, 1), 5e6), arrowprops={"arrowstyle": "->", "connectionstyle": "arc3", "lw": 3})plt.show()xxxxxxxxxxBoth indie games and non-indie games have an steady increasing trend in the total number of concurrent players from 2013 to 2020. However, non-indie games had a spike in total concurrent players from late 2017 to early 2018, before returning to the normal rate of increase at late 2018. At 2020, the rate of growth of total concurrent players for both indie and non-indie games accelerated, increasing at a faster rate with a slight amount of oscillation.However, in order to find the rise in popularity of indie games relative to non-indie games, we have to plot the proportion of concurrent players from indie and non-indie games against time, rather than the total number of concurrent players.Both indie games and non-indie games have an steady increasing trend in the total number of concurrent players from 2013 to 2020. However, non-indie games had a spike in total concurrent players from late 2017 to early 2018, before returning to the normal rate of increase at late 2018. At 2020, the rate of growth of total concurrent players for both indie and non-indie games accelerated, increasing at a faster rate with a slight amount of oscillation.
However, in order to find the rise in popularity of indie games relative to non-indie games, we have to plot the proportion of concurrent players from indie and non-indie games against time, rather than the total number of concurrent players.
xxxxxxxxxxindie = avg_players_df.loc[avg_players_df.is_indie].iloc[:, 3:].sum()non_indie = avg_players_df.loc[~avg_players_df.is_indie].iloc[:, 3:].sum()total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie})total = total_players_df.sum(axis=1)total_players_df.loc[:, "Indie"] = total_players_df.loc[:, "Indie"]/total*100total_players_df.loc[:, "Non-indie"] = 100-total_players_df.loc[:, "Indie"]pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Proportion of concurrent players from indie games and non-indie games against time", ylabel="Proportion of concurrent players", xlabel="Year")plt.show()xxxxxxxxxxThe proportion of concurrent players from indie games had a steady increase from around 12% in 2013 to around 22% in 2022. Since the proportion of concurrent players from indie games increased over time, we can imply that indie games has had a greater rate of growth than non-indie games. There was also a small dip in 2018, which is explained by the spike in total concurrent players that non-indie games had.We can also plot the proportion of concurrent players from indie and non-indie games against time for games with different popularity levels.The proportion of concurrent players from indie games had a steady increase from around 12% in 2013 to around 22% in 2022. Since the proportion of concurrent players from indie games increased over time, we can imply that indie games has had a greater rate of growth than non-indie games. There was also a small dip in 2018, which is explained by the spike in total concurrent players that non-indie games had.
We can also plot the proportion of concurrent players from indie and non-indie games against time for games with different popularity levels.
xxxxxxxxxxfor grp in ["Highest", "High", "Low", "Lowest"]: indie = avg_players_df.loc[avg_players_df.is_indie & (avg_players_df.owners_binned == grp)].iloc[:, 3:].sum() non_indie = avg_players_df.loc[~avg_players_df.is_indie & (avg_players_df.owners_binned == grp)].iloc[:, 3:].sum() total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie}) total = total_players_df.sum(axis=1) total_players_df.loc[:, "Indie"] = total_players_df.loc[:, "Indie"]/total*100 total_players_df.loc[:, "Non-indie"] = 100-total_players_df.loc[:, "Indie"] pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Proportion of concurrent players from indie games and non-indie games against time in the \""+grp+"\" bin", ylabel="Proportion of concurrent players", xlabel="Year") plt.show()xxxxxxxxxxThe proportion of concurrent players from indie games increased over time, regardless of the popularity level of the games. However, games that were less popular had a greater increase in the proportion of concurrent players from indie games over time.Therefore, we can infer that the popularity of indie games among gameers has been on the rise and is catching up to the popularity of non-indie games, especially for less popular games.The proportion of concurrent players from indie games increased over time, regardless of the popularity level of the games. However, games that were less popular had a greater increase in the proportion of concurrent players from indie games over time.
Therefore, we can infer that the popularity of indie games among gameers has been on the rise and is catching up to the popularity of non-indie games, especially for less popular games.
xxxxxxxxxxNext, we can plot the total number of indie and non-indie games released against time.Next, we can plot the total number of indie and non-indie games released against time.
xxxxxxxxxxdate_freq = steam_df.groupby(["is_indie", "date"])[["id"]].count().reset_index()date_freq = pd.pivot_table(date_freq, values="id", index=["date"], columns="is_indie").fillna(0)date_freq.loc[:, True] = date_freq.loc[:, True].cumsum()date_freq.loc[:, False] = date_freq.loc[:, False].cumsum()date_freq.iloc[:, [1, 0]].plot(kind="area", stacked=True, figsize=(20, 10), title="Total number of indie and non-indie games released against time", ylabel="Number of games", xlabel="Year", sort_columns=[True, False])plt.show()xxxxxxxxxxBoth indie and non-indie games had an increasing trend over time. However, the scaling of x-axis is unsuitable since it starts as early as 1970 and there is not much increase in the early years, so we can zoom in on the increasing trend in the 2000s.Both indie and non-indie games had an increasing trend over time. However, the scaling of x-axis is unsuitable since it starts as early as 1970 and there is not much increase in the early years, so we can zoom in on the increasing trend in the 2000s.
xxxxxxxxxxdate_freq.iloc[:, [1, 0]].plot(kind="area", figsize=(20, 10), stacked=True, title="Total number of indie and non-indie games released against time", ylabel="Number of games", xlabel="Year", xlim=[datetime(2000, 1, 1), datetime(2022, 12, 31)])plt.show()xxxxxxxxxxAfter zooming in on the 2000s, we can more clearly see the increasing trends of both indie and non-indie games. The total number of indie games had an exponential growth from 2008 onwards, quickly surpassing the total number of non-indie games in 2015. This exponential growth can be better visualised if we instead plot the proportion of indie and non-indie games released against time.After zooming in on the 2000s, we can more clearly see the increasing trends of both indie and non-indie games. The total number of indie games had an exponential growth from 2008 onwards, quickly surpassing the total number of non-indie games in 2015. This exponential growth can be better visualised if we instead plot the proportion of indie and non-indie games released against time.
xxxxxxxxxxdate_freq.loc[:, True] = date_freq.loc[:, True]/(date_freq.loc[:, True]+date_freq.loc[:, False])*100date_freq.loc[:, False] = 100-date_freq.loc[:, True]date_freq.iloc[:, [1, 0]].plot(kind="area", figsize=(20, 10), stacked=True, title="Proportion of indie and non-indie games released against time", ylabel="Proportion of games", xlabel="Year", sort_columns=[True, False], xlim=[datetime(2000, 1, 1), datetime(2022, 12, 31)])plt.annotate(text="Start of exponential growth at 2008", xy=(datetime(2008, 1, 1), 15), xytext=(datetime(2008, 1, 1), 40), arrowprops={"arrowstyle": "->", "connectionstyle": "arc3", "lw": 3})plt.show()xxxxxxxxxxThe proportion of indie games released increased from less than 10% in 2000, to around 75% in 2022. Here, we can clearly see the exponential growth from 2008 onwards, and when the proportion of indie games released reaching 50% at 2015. Due to the exponential growth of the number of indie games released from 2008 onwards, as well as how great the rate of growth indie games have in relative to non-indie games, we can conclude that the demand for indie games and the prevalence of indie games truly started to increase rapidly from 2008 onwards.The proportion of indie games released increased from less than 10% in 2000, to around 75% in 2022. Here, we can clearly see the exponential growth from 2008 onwards, and when the proportion of indie games released reaching 50% at 2015. Due to the exponential growth of the number of indie games released from 2008 onwards, as well as how great the rate of growth indie games have in relative to non-indie games, we can conclude that the demand for indie games and the prevalence of indie games truly started to increase rapidly from 2008 onwards.
xxxxxxxxxxAccording to the graph, we can also see that currently, in 2022, around 75% of the games are indie games. We can confirm this with a pie chart.According to the graph, we can also see that currently, in 2022, around 75% of the games are indie games. We can confirm this with a pie chart.
xxxxxxxxxxplt.figure(figsize=(7, 7))plt.pie( steam_df.is_indie.value_counts(), labels=["Indie", "Non-indie"], autopct='%1.1f%%', )plt.title("Proportion of indie games and non-indie games")plt.show()xxxxxxxxxxWe can also plot pie charts of each of the bins, to show the proportion of indie and non-indie games at different popularity levels.We can also plot pie charts of each of the bins, to show the proportion of indie and non-indie games at different popularity levels.
xxxxxxxxxxfor grp in ["Lowest", "Low", "High", "Highest"]: plt.figure(figsize=(7, 7)) plt.pie( steam_df.loc[avg_players_df.owners_binned == grp].is_indie.value_counts().reset_index().sort_values(by="index", ascending=False).is_indie, labels=["Indie", "Non-indie"], autopct='%1.1f%%' ) plt.title("Proportion of indie games and non-indie games in the \""+grp+"\" bin") plt.show()xxxxxxxxxxAs the popularity of games increased, the proportion of indie games decreased, decreasing from 78.7% in the "Lowest" bin to only 40.3% in the "Highest" bin. This shows that even though indie games has a faster rate of growth than non-indie games, indie games are still not able to overthrow non-indie games in terms of popularity, especially the biggest and most popular ones.As the popularity of games increased, the proportion of indie games decreased, decreasing from 78.7% in the "Lowest" bin to only 40.3% in the "Highest" bin. This shows that even though indie games has a faster rate of growth than non-indie games, indie games are still not able to overthrow non-indie games in terms of popularity, especially the biggest and most popular ones.
xxxxxxxxxxHowever, there is another way to compare the popularities of indie and non-indie games. By plotting boxplots of the total number of reviews of indie and non-indie games at different levels of popularity, we can infer if indie games are comparable in size and popularity to non-indie games. We have to seperate the different levels of popularity into different boxplots, due to the differences in the y-axis.However, there is another way to compare the popularities of indie and non-indie games. By plotting boxplots of the total number of reviews of indie and non-indie games at different levels of popularity, we can infer if indie games are comparable in size and popularity to non-indie games. We have to seperate the different levels of popularity into different boxplots, due to the differences in the y-axis.
xxxxxxxxxxfor grp in ["Highest", "High", "Low", "Lowest"]: plt.figure(figsize=(20, 10)) bin_df = steam_df.loc[steam_df.owners_binned == grp] sns.boxplot(data=bin_df, y="total_reviews", x="is_indie", order=[True, False]) plt.ylabel("Total number of reviews") plt.title("Distribution of the total number of reviews of indie games and non-indie games in the \""+grp+"\" bin") plt.show()xxxxxxxxxxUnfortunately, due to the many outliers present above the upper bound, the boxplots are unable to be seen, thus we would need to hide the outliers.Unfortunately, due to the many outliers present above the upper bound, the boxplots are unable to be seen, thus we would need to hide the outliers.
xxxxxxxxxxfor grp in ["Highest", "High", "Low", "Lowest"]: plt.figure(figsize=(20, 10)) bin_df = steam_df.loc[steam_df.owners_binned == grp] sns.boxplot(data=bin_df, y="total_reviews", x="is_indie", order=[True, False], showfliers=False) plt.ylabel("Total number of reviews") plt.title("Distribution of the total number of reviews of indie games and non-indie games in the \""+grp+"\" bin") plt.show()xxxxxxxxxxIn all 4 bins, the median of indie games are higher than the median of non-indie games. Both the indie games and non-indie games have distributions that are skewed to the right for all 4 bins. The IQR of indie games was larger than that of non-indie games in the "Highest" and "Low" bins, and vice versa for the "High" and "Lowest" bins.The median of the total number of reviews of indie games are consistently higher than that of non-indie games, thus we can imply that indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.In all 4 bins, the median of indie games are higher than the median of non-indie games. Both the indie games and non-indie games have distributions that are skewed to the right for all 4 bins. The IQR of indie games was larger than that of non-indie games in the "Highest" and "Low" bins, and vice versa for the "High" and "Lowest" bins.
The median of the total number of reviews of indie games are consistently higher than that of non-indie games, thus we can imply that indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.
xxxxxxxxxx### Q2: What are the major differences between indie games and AAA games? <a id="Q2_EDA"></a>xxxxxxxxxxFirstly, we can compare the quality of indie and non-indie games by plotting the distribution of rating of indie and non-indie games.Firstly, we can compare the quality of indie and non-indie games by plotting the distribution of rating of indie and non-indie games.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="rating", x="is_indie", order=[True, False])plt.title("Distribution of rating of indie and non-indie games")plt.ylabel("Rating")plt.show()xxxxxxxxxxIndie games have a higher median than non-indie games. Both indie and non-indie games have distributions that were skewed to the left. Indie games have a smaller IQR than non-indie games. Both indie and non-indie games have outliers below the lower bound.This pattern is also consistent at different popularity levels.Indie games have a higher median than non-indie games. Both indie and non-indie games have distributions that were skewed to the left. Indie games have a smaller IQR than non-indie games. Both indie and non-indie games have outliers below the lower bound.
This pattern is also consistent at different popularity levels.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="rating", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False])plt.title("Distribution of rating of indie and non-indie games at different popularity levels")plt.ylabel("Rating")plt.show()xxxxxxxxxxIn all 4 bins, indie games have a higher median than non-indie games, both indie and non-indie games have distributions that were skewed to the left, indie games have a smaller IQR than non-indie games, and both indie and non-indie games have outliers below the lower bound.It is also worth noting that indie and non-indie games in the "Highest" bin have the greatest difference in rating medians compared to the other 3 bins, which can show that the as the level of popularity increases, the difference in quality between indie and non-indie games becomes larger, where more popular indie games would be much more well-received compared to other non-indie games of around the same popularity.Regardless of the level of popularity, indie games are overall more enjoyable and more positively received than non-indie games, as seen by the higher median of indie games. There is also less variation in quality in indie games than in non-indie games, shown by the smaller IQR of indie games.In all 4 bins, indie games have a higher median than non-indie games, both indie and non-indie games have distributions that were skewed to the left, indie games have a smaller IQR than non-indie games, and both indie and non-indie games have outliers below the lower bound.
It is also worth noting that indie and non-indie games in the "Highest" bin have the greatest difference in rating medians compared to the other 3 bins, which can show that the as the level of popularity increases, the difference in quality between indie and non-indie games becomes larger, where more popular indie games would be much more well-received compared to other non-indie games of around the same popularity.
Regardless of the level of popularity, indie games are overall more enjoyable and more positively received than non-indie games, as seen by the higher median of indie games. There is also less variation in quality in indie games than in non-indie games, shown by the smaller IQR of indie games.
xxxxxxxxxxNext, we can compare the length of indie and non-indie games by plotting the distribution of average playtime of indie and non-indie games.Next, we can compare the length of indie and non-indie games by plotting the distribution of average playtime of indie and non-indie games.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="avg_playtime", x="is_indie", order=[True, False])plt.title("Distribution of average playtime of indie and non-indie games")plt.ylabel("Average Playtime")plt.show()xxxxxxxxxxThere are many outliers above the upper bounds of indie and non-indie games. However, even if we hide outliers, too many of the values are 0, thus we are unable to get useful boxplots.There are many outliers above the upper bounds of indie and non-indie games. However, even if we hide outliers, too many of the values are 0, thus we are unable to get useful boxplots.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="avg_playtime", x="is_indie", order=[True, False], showfliers=False)plt.title("Distribution of average playtime of indie and non-indie games")plt.ylabel("Average Playtime")plt.show()xxxxxxxxxxIf we plot the distribution of average playtime of indie and non-indie games by the level of popularity, then we can get useful boxplots.If we plot the distribution of average playtime of indie and non-indie games by the level of popularity, then we can get useful boxplots.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="avg_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False], showfliers=False)plt.title("Distribution of average playtime of indie and non-indie games")plt.ylabel("Average Playtime")plt.show()xxxxxxxxxxToo many of the values are 0 in the "Lowest" bin, thus we are unfortunately unable to use it for any observations. The other bins, however, do allow us to observe some trends.The median of indie games is lower than the median of non-indie games in the "Highest" and "High" bins, while both medians of indie and non-indie games are 0 in the "Low" bin. Both indie and non-indie games were have distributions that are skewed to the right in all 3 bins. Indie games had a lower IQR than non-indie games in the "Highest" and "High" bins, and was vice versa in the "Low" bin.We can get similar results if we plot the distribution of median playtime as well.Too many of the values are 0 in the "Lowest" bin, thus we are unfortunately unable to use it for any observations. The other bins, however, do allow us to observe some trends.
The median of indie games is lower than the median of non-indie games in the "Highest" and "High" bins, while both medians of indie and non-indie games are 0 in the "Low" bin. Both indie and non-indie games were have distributions that are skewed to the right in all 3 bins. Indie games had a lower IQR than non-indie games in the "Highest" and "High" bins, and was vice versa in the "Low" bin.
We can get similar results if we plot the distribution of median playtime as well.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="median_playtime", x="is_indie", order=[True, False])plt.title("Distribution of median playtime of indie and non-indie games")plt.ylabel("Median Playtime")plt.show()xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="median_playtime", x="is_indie", order=[True, False], showfliers=False)plt.title("Distribution of median playtime of indie and non-indie games")plt.ylabel("Median Playtime")plt.show()xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="median_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False], showfliers=False)plt.title("Distribution of median playtime of indie and non-indie games")plt.ylabel("Median Playtime")plt.show()xxxxxxxxxxFrom these graphs, we can conclude that indie games are overall shorter in length and do not have as much content as non-indie games have, due to indie games having a lower median of playtime. This makes sense due to how indie game developers would have lesser resources and manpower and are unable to create a game as large in scale as a non-indie game.From these graphs, we can conclude that indie games are overall shorter in length and do not have as much content as non-indie games have, due to indie games having a lower median of playtime. This makes sense due to how indie game developers would have lesser resources and manpower and are unable to create a game as large in scale as a non-indie game.
xxxxxxxxxxWe can also compare how expensive indie and non-indie games are by plotting the distribution of price of indie and non-indie games.We can also compare how expensive indie and non-indie games are by plotting the distribution of price of indie and non-indie games.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="is_indie", order=[True, False])plt.title("Distribution of price of indie and non-indie games")plt.ylabel("Price")plt.show()xxxxxxxxxxBy hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.By hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="is_indie", order=[True, False], showfliers=False)plt.title("Distribution of price of indie and non-indie games")plt.ylabel("Price")plt.show()xxxxxxxxxxThe median of indie games is lower than the median of non-indie games. Both indie and non-indie games have a distribution that is skewed to the right. Indie games have a smaller IQR than non-indie games.We will get similar results if we seperate the games by level of popularity.The median of indie games is lower than the median of non-indie games. Both indie and non-indie games have a distribution that is skewed to the right. Indie games have a smaller IQR than non-indie games.
We will get similar results if we seperate the games by level of popularity.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False])plt.title("Distribution of price of indie and non-indie games")plt.ylabel("Price")plt.show()xxxxxxxxxxBy hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.By hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False], showfliers=False)plt.title("Distribution of price of indie and non-indie games")plt.ylabel("Price")plt.show()xxxxxxxxxxIn all 4 bins, the median of indie games is lower than the median of non-indie games and both indie and non-indie games have a distribution that is skewed to the right. Indie games have a smaller IQR than non-indie games for all bins except fot the "Lowest" bin, where the IQRs are equal.It is also worth noting that as popularity increases, the difference between the medians of indie and non-indie games compared also increases, which can show that more popular non-indie games would be much more expensive than other indie games of around the same popularity.From the graphs, we can conclude that overall, indie games are cheaper than non-indie games, as shown by indie games having a lower median of price. Non-indie games also have a more diverse range of prices, as shown by the higher IQR of non-indie games.In all 4 bins, the median of indie games is lower than the median of non-indie games and both indie and non-indie games have a distribution that is skewed to the right. Indie games have a smaller IQR than non-indie games for all bins except fot the "Lowest" bin, where the IQRs are equal.
It is also worth noting that as popularity increases, the difference between the medians of indie and non-indie games compared also increases, which can show that more popular non-indie games would be much more expensive than other indie games of around the same popularity.
From the graphs, we can conclude that overall, indie games are cheaper than non-indie games, as shown by indie games having a lower median of price. Non-indie games also have a more diverse range of prices, as shown by the higher IQR of non-indie games.
xxxxxxxxxxIf a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie and non-indie games to compare how accessible indie and non-indie games are, as well as to give a general gaugue on the quality of indie and non-indie games.If a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie and non-indie games to compare how accessible indie and non-indie games are, as well as to give a general gaugue on the quality of indie and non-indie games.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.kdeplot(data=steam_df.loc[is_indie], x="languages_count", label="Indie")sns.kdeplot(data=steam_df.loc[~is_indie], x="languages_count", label="Non-indie")plt.legend()plt.title("Distribution of the number of languages of indie and non-indie games")plt.xlabel("Number of Languages")plt.show()xxxxxxxxxxThe overall shape of both of the distributions of indie and non-indie games are similar, having a peak at 1 language before decreasing in density as the number of languages increases. It is also worth nothing that there is a small spike in density between 25 and 30 languages, which most likely represents the biggest and most popular indie and non-indie games that have many language options.However, there is a greater density of indie games with less than 5 languages compared to non-indie games, while non-indie games has a greater distribution of the number of languages, having a greater density than indie games when the number of languages is 5 or greater.Therefore, we can conclude that overall, non-indie games overall have a greater number of language options than indie games, which show that indie games are less accessible than non-indie games. However, it is not necessarily that indie games are poorer in quality, as it could be just that indie games, due to their lack of manpower, cannot find people that are fluent in different languages to provide translations, while non-indie games have many different people working on it, some of which are fluent in other languages.The overall shape of both of the distributions of indie and non-indie games are similar, having a peak at 1 language before decreasing in density as the number of languages increases. It is also worth nothing that there is a small spike in density between 25 and 30 languages, which most likely represents the biggest and most popular indie and non-indie games that have many language options.
However, there is a greater density of indie games with less than 5 languages compared to non-indie games, while non-indie games has a greater distribution of the number of languages, having a greater density than indie games when the number of languages is 5 or greater.
Therefore, we can conclude that overall, non-indie games overall have a greater number of language options than indie games, which show that indie games are less accessible than non-indie games. However, it is not necessarily that indie games are poorer in quality, as it could be just that indie games, due to their lack of manpower, cannot find people that are fluent in different languages to provide translations, while non-indie games have many different people working on it, some of which are fluent in other languages.
xxxxxxxxxxWe can compare the amount of manpower working behind indie and non-indie games by plotting the distribution of the number of developers.We can compare the amount of manpower working behind indie and non-indie games by plotting the distribution of the number of developers.
xxxxxxxxxxdevelopers_df = steam_df.groupby(["is_indie", "developers_count"])[["id"]].count()developers_df = pd.pivot_table(developers_df, values="id", index=["developers_count"], columns="is_indie").fillna(0)developers_df.iloc[:, [1, 0]].plot(kind="bar", figsize=(20, 10), xlabel="Number of Developers", ylabel="Number of Games", title="Number of indie and non-indie games by the number of developers")plt.show()xxxxxxxxxxHowever, since the total numbers of indie and non-indie games are unequal, it is unfair and inaccurate to just compare the distribution of the number of indie and non-indie games as there are many more indie games than non-indie games in the dataset. Therefore, it is more appropriate to plot the distribution of the proportion of indie and non-indie games by the number of developers.However, since the total numbers of indie and non-indie games are unequal, it is unfair and inaccurate to just compare the distribution of the number of indie and non-indie games as there are many more indie games than non-indie games in the dataset. Therefore, it is more appropriate to plot the distribution of the proportion of indie and non-indie games by the number of developers.
xxxxxxxxxxdevelopers_df.loc[:, True] = developers_df.loc[:, True]/len(steam_df.loc[is_indie])*100developers_df.loc[:, False] = developers_df.loc[:, False]/len(steam_df.loc[~is_indie])*100developers_df.iloc[:, [1, 0]].plot(kind="bar", figsize=(20, 10), xlabel="Number of Developers", ylabel="Proportion of Games", title="Proportion of indie and non-indie games by the number of developers")plt.show()xxxxxxxxxxSince the proportion of indie and non-indie games with 6 developers and above, we can sum the proportions of indie and non-indie games with 5 developers and above.Since the proportion of indie and non-indie games with 6 developers and above, we can sum the proportions of indie and non-indie games with 5 developers and above.
xxxxxxxxxxfive_and_above = developers_df.iloc[4:].sum()developers_df = developers_df.Tdevelopers_df.insert(4, "5+", five_and_above)developers_df.T.iloc[0:5, [1, 0]].plot(kind="bar", figsize=(20, 10), xlabel="Number of Developers", ylabel="Proportion of Games", title="Proportion of indie and non-indie games by the number of developers")plt.show()xxxxxxxxxxBoth indie and non-indie games have the largest proportion of games with only 1 developer. However, the proportion of indie games with only 1 developer is larger than that of non-indie games, while for games with 2 or more developers, the proportion of non-indie games is larger than that of indie games. This shows that overall, non-indie games are more likely to have a greater number of developers.We can get a similar trend by plotting the distribution of the number of publishers.Both indie and non-indie games have the largest proportion of games with only 1 developer. However, the proportion of indie games with only 1 developer is larger than that of non-indie games, while for games with 2 or more developers, the proportion of non-indie games is larger than that of indie games. This shows that overall, non-indie games are more likely to have a greater number of developers.
We can get a similar trend by plotting the distribution of the number of publishers.
xxxxxxxxxxpublishers_df = steam_df.groupby(["is_indie", "publishers_count"])[["id"]].count()publishers_df = pd.pivot_table(publishers_df, values="id", index=["publishers_count"], columns="is_indie").fillna(0)publishers_df.iloc[:, [1, 0]].plot(kind="bar", figsize=(20, 10), xlabel="Number of Publishers", ylabel="Number of Games", title="Number of indie and non-indie games by the number of developers")plt.show()xxxxxxxxxxBy plotting the distribution of the proportion of indie and non-indie games by the number of publishers, we get the following graph.By plotting the distribution of the proportion of indie and non-indie games by the number of publishers, we get the following graph.
xxxxxxxxxxpublishers_df.loc[:, True] = publishers_df.loc[:, True]/len(steam_df.loc[is_indie])*100publishers_df.loc[:, False] = publishers_df.loc[:, False]/len(steam_df.loc[~is_indie])*100publishers_df.iloc[:, [1, 0]].plot(kind="bar", figsize=(20, 10), xlabel="Number of Publishers", ylabel="Proportion of Games", title="Proportion of indie and non-indie games by the number of developers")plt.show()xxxxxxxxxxBoth indie and non-indie games have the largest proportion of games with only 1 publisher. However, the proportion of indie games with only 1 publisher is larger than that of non-indie games, while for games with 2 or more publishers, the proportion of non-indie games is larger than that of indie games. This shows that overall, non-indie games are more likely to have a greater number of publishers.Therefore, indie games would have less manpower behind them than non-indie games, shown by the lesser number of developers and publishers.Both indie and non-indie games have the largest proportion of games with only 1 publisher. However, the proportion of indie games with only 1 publisher is larger than that of non-indie games, while for games with 2 or more publishers, the proportion of non-indie games is larger than that of indie games. This shows that overall, non-indie games are more likely to have a greater number of publishers.
Therefore, indie games would have less manpower behind them than non-indie games, shown by the lesser number of developers and publishers.
xxxxxxxxxxWe can compare the most popular genres and tags of indie and non-indie games to find out what types of indie and non-indie games are being produced.Firstly, we need to find the proportions of indie and non-indie games that are in each genre.We can compare the most popular genres and tags of indie and non-indie games to find out what types of indie and non-indie games are being produced.
Firstly, we need to find the proportions of indie and non-indie games that are in each genre.
xxxxxxxxxxgenres_df = pd.DataFrame({True: pd.Series(steam_df.loc[steam_df.is_indie].genres.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[is_indie]), False: pd.Series(steam_df.loc[~steam_df.is_indie].genres.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[~is_indie]) }).drop("Indie")*100genres_df = genres_df.sort_values(by=True, ascending=False)genres_dfxxxxxxxxxxApart from the top 11 genres, the proportions of indie and non-indie games grow very small, less than 0.1%. Therefore, we should only consider the top 11 genres.Apart from the top 11 genres, the proportions of indie and non-indie games grow very small, less than 0.1%. Therefore, we should only consider the top 11 genres.
xxxxxxxxxxgenres_df = genres_df[0:11]genres_dfxxxxxxxxxxFinally, we can plot the proportions of indie and non-indie games that are in each of the top 11 genres.Finally, we can plot the proportions of indie and non-indie games that are in each of the top 11 genres.
xxxxxxxxxxgenres_df.plot(kind="bar", figsize=(20, 10), xlabel="Genre", ylabel="Proportion of Games", title="Proportions of indie and non-indie games in each of the top 11 genres")plt.legend().set_title("is_indie")plt.show()xxxxxxxxxxThe "Action", "Casual" and "Adventure" genres were the 3 top genres for both indie and non-indie games. However, there is a higher proportion of indie games that are in these 3 top genres compared to non-indie games. Other than the top 3 genres, there is also a higher proportion of indie games in the "RPG" and "Early Access" genres, whereas the "Strategy", "Simulation", "Free to Play", "Sports", "Racing" and "Massively Multiplayer" genres have a higher proportion of non-indie games.This graph shows us that indie games is not as diverse in its genres compared to non-indie games, as seen by a higher proportion of indie games being in top 3 genres instead of having a more even distribution. This can be due to the limitations that indie games face but non-indie games do not, restricting the genre of game indie game developers can produce. For example, games in the "Simulation", "Sports" and "Racing" genres might require a level of realism in terms of graphics and gameplay, which might require more resources and manpower that indie games do not have. Games in the "Strategy" genre might require more complicated and in-depth game mechanics to keep players hooked, while games in the "Massively Multiplayer" genre would require running servers to support multiplayer, both of which might be difficult for an individual to implement if they do not have the prior knowledge and resources. We can also find the proportions of indie and non-indie games that have each tag.The "Action", "Casual" and "Adventure" genres were the 3 top genres for both indie and non-indie games. However, there is a higher proportion of indie games that are in these 3 top genres compared to non-indie games. Other than the top 3 genres, there is also a higher proportion of indie games in the "RPG" and "Early Access" genres, whereas the "Strategy", "Simulation", "Free to Play", "Sports", "Racing" and "Massively Multiplayer" genres have a higher proportion of non-indie games.
This graph shows us that indie games is not as diverse in its genres compared to non-indie games, as seen by a higher proportion of indie games being in top 3 genres instead of having a more even distribution. This can be due to the limitations that indie games face but non-indie games do not, restricting the genre of game indie game developers can produce. For example, games in the "Simulation", "Sports" and "Racing" genres might require a level of realism in terms of graphics and gameplay, which might require more resources and manpower that indie games do not have. Games in the "Strategy" genre might require more complicated and in-depth game mechanics to keep players hooked, while games in the "Massively Multiplayer" genre would require running servers to support multiplayer, both of which might be difficult for an individual to implement if they do not have the prior knowledge and resources.
We can also find the proportions of indie and non-indie games that have each tag.
xxxxxxxxxxtags_df = pd.DataFrame({True: pd.Series(steam_df.loc[steam_df.is_indie].tags.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[is_indie]), False: pd.Series(steam_df.loc[~steam_df.is_indie].tags.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[~is_indie]) }).drop(list(genres_df.index) + ["Indie"])*100tags_dfxxxxxxxxxxNow, we can plot the top 20 tags with the highest proportion of indie or non-indie games.Now, we can plot the top 20 tags with the highest proportion of indie or non-indie games.
xxxxxxxxxxtags_df.sort_values(by=True, ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Proportion of Games", title="Top 20 tags with the highest proportion of indie games")plt.show()tags_df.sort_values(by=False, ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Proportion of Games", title="Top 20 tags with the highest proportion of non-indie games")plt.show()xxxxxxxxxx"Singleplayer" was by far the most popular tag for both indie and non-indie games, with the proportion of non-indie games with the "Singleplayer" tag only being slightly higher than that of indie games. There were also many other tags that appeared in the top 10 tags for both indie and non-indie games, which are the "Multiplayer", "2D", "3D", "Story Rich", "Atmospheric", "Puzzle", "Fantasy", "Anime", "Cute", "Colorful" and "Arcade" tags. Indie and non-indie games share 12 out of 20 tags for their top 20 tags, therefore these graphs unfortunately do not tell us much about the differences in the types of games indie and non-indie games are. Rather, it just shows the tags that are popular overall.Therefore, we instead need to find the top 20 tags with the greatest ratio of indie games to non-indie games, and vice versa."Singleplayer" was by far the most popular tag for both indie and non-indie games, with the proportion of non-indie games with the "Singleplayer" tag only being slightly higher than that of indie games. There were also many other tags that appeared in the top 10 tags for both indie and non-indie games, which are the "Multiplayer", "2D", "3D", "Story Rich", "Atmospheric", "Puzzle", "Fantasy", "Anime", "Cute", "Colorful" and "Arcade" tags.
Indie and non-indie games share 12 out of 20 tags for their top 20 tags, therefore these graphs unfortunately do not tell us much about the differences in the types of games indie and non-indie games are. Rather, it just shows the tags that are popular overall.
Therefore, we instead need to find the top 20 tags with the greatest ratio of indie games to non-indie games, and vice versa.
xxxxxxxxxxtags_top = tags_df.loc[(tags_df.iloc[:, 0] >= tags_df.quantile(0.75).iloc[0]) | (tags_df.iloc[:, 1] >= tags_df.quantile(0.75).iloc[1])](tags_top.iloc[:, 0]/tags_top.iloc[:, 1]).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games to non-indie games")plt.show()(tags_top.iloc[:, 1]/tags_top.iloc[:, 0]).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), color="orange", xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of non-indie games to indie games")plt.show()xxxxxxxxxxNow, we can finally see some patterns in the types of games being developed as indie and non-indie games.For indie games, "Short" is the top tag by quite a margin. This makes sense, as indie game developers usually do not have the resources or manpower to create extremely long games with a lot of content. However, some of the other tags do give us an idea of how indie game developers solve these problems. For example, some indie games contain some sort of procedural generation, which is a algorithmic process of generating gameplay. This can allow the gameplay to feel fresh and unrepetitive without the need for the human touch, increasing the replay value of indie games. As it turns out, "Procedural Generation" and "Replay Value" are both included in the top 20 tags for indie games. Some examples of games that use procedural generation are roguelikes and roguelites, which also both appear as the tags "Roguelike" and "Roguelite". Some indie games can also make gameplay more fun is by making it more difficult or fast-paced and requiring time to master, which can explain the tags "Difficult" and "Fast-Paced". Some of these games can include "Bullet Hell", "Top-Down Shooter" and "Shoot Em Up", which also appear as tags. Finally, platformers and puzzle games are quite popular among indie games, with the tags "Puzzle Platformer", "Platformer", "Logic" and "Puzzle" all appearing in the top 20. This can be due to puzzle and platformers usually having simpler types of gameplay than other types of games.On the other hand, for non-indie games, "Classic" is the top tag by quite a margin. This can be due to many non-indie games that are seen as classics or having recognisable characters in them. There are also some types of games that are more complicated and require more resources and manpower. There are the "Historical", "Military", "War" and "Driving" tags, where these games have to as realistic as possible, making them complex to develop. There are the "RTS", "JRPG", "Turn-Based Strategy" and "Tactical" tags, which have to have some in-depth strategy and enough balancing to create interesting gameplay. There are also games with the "Open World" tag that are usually at very large scales, as they have to incentivise players to explore a world that has to be large enough. Lastly, there are the "Multiplayer", "Online Co-Op", "VR", "PvP" and "Co-op" tags, which would require external software, such as servers and VR headsets, in order to run. Therefore, these graphs and tags show that in order to combat the lack of resources and manpower, there are some patterns that emerge among indie games, such as making gameplay more unique or interesting, as well as sticking to types of games that are easier to develop over other types that can be more difficult to develop.Now, we can finally see some patterns in the types of games being developed as indie and non-indie games.
For indie games, "Short" is the top tag by quite a margin. This makes sense, as indie game developers usually do not have the resources or manpower to create extremely long games with a lot of content. However, some of the other tags do give us an idea of how indie game developers solve these problems. For example, some indie games contain some sort of procedural generation, which is a algorithmic process of generating gameplay. This can allow the gameplay to feel fresh and unrepetitive without the need for the human touch, increasing the replay value of indie games. As it turns out, "Procedural Generation" and "Replay Value" are both included in the top 20 tags for indie games. Some examples of games that use procedural generation are roguelikes and roguelites, which also both appear as the tags "Roguelike" and "Roguelite". Some indie games can also make gameplay more fun is by making it more difficult or fast-paced and requiring time to master, which can explain the tags "Difficult" and "Fast-Paced". Some of these games can include "Bullet Hell", "Top-Down Shooter" and "Shoot Em Up", which also appear as tags. Finally, platformers and puzzle games are quite popular among indie games, with the tags "Puzzle Platformer", "Platformer", "Logic" and "Puzzle" all appearing in the top 20. This can be due to puzzle and platformers usually having simpler types of gameplay than other types of games.
On the other hand, for non-indie games, "Classic" is the top tag by quite a margin. This can be due to many non-indie games that are seen as classics or having recognisable characters in them. There are also some types of games that are more complicated and require more resources and manpower. There are the "Historical", "Military", "War" and "Driving" tags, where these games have to as realistic as possible, making them complex to develop. There are the "RTS", "JRPG", "Turn-Based Strategy" and "Tactical" tags, which have to have some in-depth strategy and enough balancing to create interesting gameplay. There are also games with the "Open World" tag that are usually at very large scales, as they have to incentivise players to explore a world that has to be large enough. Lastly, there are the "Multiplayer", "Online Co-Op", "VR", "PvP" and "Co-op" tags, which would require external software, such as servers and VR headsets, in order to run.
Therefore, these graphs and tags show that in order to combat the lack of resources and manpower, there are some patterns that emerge among indie games, such as making gameplay more unique or interesting, as well as sticking to types of games that are easier to develop over other types that can be more difficult to develop.
xxxxxxxxxx### Q3: What factors contribute to the success of an indie game? <a id="Q3_EDA"></a>xxxxxxxxxxFirstly, we need to filter out only all of the indie games in the data set.Firstly, we need to filter out only all of the indie games in the data set.
xxxxxxxxxxindie_df = steam_df.loc[is_indie]indie_dfxxxxxxxxxxWe can compare the quality of indie games of different popularities by plotting the distribution of rating of indie games of different popularities.We can compare the quality of indie games of different popularities by plotting the distribution of rating of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="rating", x="owners_binned", order=["Highest", "High", "Low", "Lowest"])plt.title("Distribution of rating of indie games of different popularities")plt.ylabel("Rating")plt.xlabel("Bins")plt.show()xxxxxxxxxxAll 4 bins have distributions that were skewed to the left and have outliers below the lower bound. As popularity increases, the median also increases, and the IQR increases as well.Therefore, more popular indie games are more enjoyable and better received, as shown by the higher medians of rating, as well as being more consistent in quality, as shown by the lower IQRs.All 4 bins have distributions that were skewed to the left and have outliers below the lower bound. As popularity increases, the median also increases, and the IQR increases as well.
Therefore, more popular indie games are more enjoyable and better received, as shown by the higher medians of rating, as well as being more consistent in quality, as shown by the lower IQRs.
xxxxxxxxxxNext, we can compare the length of indie games of different popularities by plotting the distribution of average playtime of indie games of different popularities.Next, we can compare the length of indie games of different popularities by plotting the distribution of average playtime of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=indie_df, y="avg_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"])plt.title("Distribution of average playtime of indie games of different popularities")plt.ylabel("Average Playtime")plt.xlabel("Bins")plt.show()xxxxxxxxxxBy hiding the outliers above the upper bounds of all bins, we get the following graph.By hiding the outliers above the upper bounds of all bins, we get the following graph.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=indie_df, y="avg_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], showfliers=False)plt.title("Distribution of average playtime of indie games of different popularities")plt.ylabel("Average Playtime")plt.xlabel("Bins")plt.show()xxxxxxxxxxAs popularity decreases, the median decreases, and the IQR decreases as well. All 4 bins had distributions skewed to the right.Similar results are shown by plotting the distribution of median playtime of indie games of different popularities.As popularity decreases, the median decreases, and the IQR decreases as well. All 4 bins had distributions skewed to the right.
Similar results are shown by plotting the distribution of median playtime of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=indie_df, y="median_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"])plt.title("Distribution of median playtime of indie games of different popularities")plt.ylabel("Median Playtime")plt.xlabel("Bins")plt.show()xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=indie_df, y="median_playtime", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], showfliers=False)plt.title("Distribution of median playtime of indie games of different popularities")plt.ylabel("Median Playtime")plt.xlabel("Bins")plt.show()xxxxxxxxxxTherefore, as more popular indie games would be longer and have more content, as shown by the higher medians of playtime, which can mean that they are of a larger scale.Therefore, as more popular indie games would be longer and have more content, as shown by the higher medians of playtime, which can mean that they are of a larger scale.
xxxxxxxxxxWe can also compare how expensive indie games of different popularities are by plotting the distribution of price of indie games of different popularities.We can also compare how expensive indie games of different popularities are by plotting the distribution of price of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="owners_binned", order=["Highest", "High", "Low", "Lowest"])plt.title("Distribution of price of indie games of different popularities")plt.ylabel("Price")plt.xlabel("Bins")plt.show()xxxxxxxxxxBy hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.By hiding the outliers above the upper bounds of both indie and non-indie games, we get the following graph.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="price", x="owners_binned", showfliers=False, order=["Highest", "High", "Low", "Lowest"])plt.title("Distribution of price of indie games of different popularities")plt.ylabel("Price")plt.xlabel("Bins")plt.show()xxxxxxxxxxAs popularity increases, the median also increases, and the IQR increases as well. All 4 bins have distributions that were skewed to the right.Therefore, more popular indie games are more expensive, as shown by the higher medians of price.As popularity increases, the median also increases, and the IQR increases as well. All 4 bins have distributions that were skewed to the right.
Therefore, more popular indie games are more expensive, as shown by the higher medians of price.
xxxxxxxxxxIf a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie games of different popularities to compare how accessible indie games of different popularities are, as well as to give a general gaugue on the quality of indie games of different popularities.If a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie games of different popularities to compare how accessible indie games of different popularities are, as well as to give a general gaugue on the quality of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Highest"], x="languages_count", label="Highest")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "High"], x="languages_count", label="High")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Low"], x="languages_count", label="Low")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Lowest"], x="languages_count", label="Lowest")plt.title("Distribution of the number of languages of indie games of different popularities")plt.xlabel("Number of Languages")plt.legend()plt.show()xxxxxxxxxxAs popularity increases, the density of indie games with less than 5 languages decreases, while the density of indie games when the number of languages is 5 or greater increases. While "High", "Low" and "Lowest" had similar shapes, having a peak at 1 language before decreasing in density as the number of languages increases, "Highest" had a completely different shape, having a much more distributed shape with a maximum density at around 10 languages.Therefore, we can conclude that overall, as popularity increases, the number of languages increases. This implies that indie games that are more popular would be of a higher quality and are more accessible for players.As popularity increases, the density of indie games with less than 5 languages decreases, while the density of indie games when the number of languages is 5 or greater increases. While "High", "Low" and "Lowest" had similar shapes, having a peak at 1 language before decreasing in density as the number of languages increases, "Highest" had a completely different shape, having a much more distributed shape with a maximum density at around 10 languages.
Therefore, we can conclude that overall, as popularity increases, the number of languages increases. This implies that indie games that are more popular would be of a higher quality and are more accessible for players.
xxxxxxxxxxWe can compare the most popular genres and tags of indie games of different popularities to find out what types of indie games are being produced.Firstly, we need to find the proportions of indie games of different popularities that are in each of the same top 11 genres as before.We can compare the most popular genres and tags of indie games of different popularities to find out what types of indie games are being produced.
Firstly, we need to find the proportions of indie games of different popularities that are in each of the same top 11 genres as before.
xxxxxxxxxxgenres_df = pd.DataFrame({"Highest": pd.Series(indie_df.loc[indie_df.owners_binned == "Highest"].genres.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Highest"]), "High": pd.Series(indie_df.loc[indie_df.owners_binned == "High"].genres.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "High"]), "Low": pd.Series(indie_df.loc[indie_df.owners_binned == "Low"].genres.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Low"]), "Lowest": pd.Series(indie_df.loc[indie_df.owners_binned == "Lowest"].genres.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Lowest"]) }).drop("Indie")*100genres_df = genres_df.loc[genres_df.sum(axis=1).sort_values(ascending=False).index[0:11]]genres_dfxxxxxxxxxxWe can then plot a heatmap showing the proportions of indie games of different popularities that are in each of the top 11 genres.We can then plot a heatmap showing the proportions of indie games of different popularities that are in each of the top 11 genres.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.heatmap(genres_df, cmap="viridis", annot=True)plt.title("Proportions of indie games of different popularities that are in each of the top 11 genres")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()xxxxxxxxxxThe "Action" tag had the highest proportion of games in all 4 bins. As popularity increases, the "Action", "Strategy", "Simulation", "RPG", "Free to Play" and "Massively Multiplayer" genres have an increasing trend, the "Casual", "Early Access", "Sports" and "Racing" genres have a decreasing trend, and the "Adventure" genre had no obvious trend.Unfortunately, due to the results being so mixed, no obvious pattern emerges, unlike when indie and non-indie games were being compared.We can also compare the median of ratings at different popularity levels for each of the top 11 genres.The "Action" tag had the highest proportion of games in all 4 bins. As popularity increases, the "Action", "Strategy", "Simulation", "RPG", "Free to Play" and "Massively Multiplayer" genres have an increasing trend, the "Casual", "Early Access", "Sports" and "Racing" genres have a decreasing trend, and the "Adventure" genre had no obvious trend.
Unfortunately, due to the results being so mixed, no obvious pattern emerges, unlike when indie and non-indie games were being compared.
We can also compare the median of ratings at different popularity levels for each of the top 11 genres.
xxxxxxxxxxfor genre in genres_df.index: for col in genres_df.columns: genres_df.loc[genre, col] = indie_df.loc[(indie_df.genres.str.contains(genre)) & (indie_df.owners_binned == col)].rating.median()genres_df = genres_df.loc[genres_df.mean(axis=1).sort_values(ascending=False).index]genres_dfxxxxxxxxxxWe can now plot a heatmap showing the median of ratings of indie games at different popularity levels for each of the top 11 genres.We can now plot a heatmap showing the median of ratings of indie games at different popularity levels for each of the top 11 genres.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.heatmap(genres_df, cmap="viridis", annot=True)plt.title("Median of ratings of indie games at different popularity levels for each of the top 11 genres")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()xxxxxxxxxxNot a lot of patterns emerge, as the main pattern of the median increasing as popularity increases is consistent for all genres. However, there are two outliers, where the "Free to Play" genre has a significantly lower median for the "Highest" bin, and the "Massively Multiplayer" genre has a significantly lower median for all 4 bins. This could be due to very popular free games not being as good of quality as other games with similar popularities, and massively multiplayer games are just too difficult for indie developers to develop successfully and effectively.Finally, we can compare the median of prices at different popularity levels for each of the top 11 genres.Not a lot of patterns emerge, as the main pattern of the median increasing as popularity increases is consistent for all genres. However, there are two outliers, where the "Free to Play" genre has a significantly lower median for the "Highest" bin, and the "Massively Multiplayer" genre has a significantly lower median for all 4 bins. This could be due to very popular free games not being as good of quality as other games with similar popularities, and massively multiplayer games are just too difficult for indie developers to develop successfully and effectively.
Finally, we can compare the median of prices at different popularity levels for each of the top 11 genres.
xxxxxxxxxxfor genre in genres_df.index: for col in genres_df.columns: genres_df.loc[genre, col] = indie_df.loc[(indie_df.genres.str.contains(genre)) & (indie_df.owners_binned == col)].price.median()genres_df = genres_df.loc[genres_df.mean(axis=1).sort_values(ascending=False).index]genres_dfxxxxxxxxxxWe can now plot a heatmap showing the median of prices of indie games at different popularity levels for each of the top 11 genres.We can now plot a heatmap showing the median of prices of indie games at different popularity levels for each of the top 11 genres.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.heatmap(genres_df, cmap="viridis", annot=True)plt.title("Median of prices of indie games at different popularity levels for each of the top 11 genres")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()xxxxxxxxxxSimilarly, not a lot of patterns emerge, as the main pattern of the median increasing as popularity increases is consistent for all genres, except for 4 genres. These 4 genres are the "Sports" and "Massively Multiplayer" genres, which has a decreasing trend, as well as the "Casual" and "Free to Play" genres, with no clear trend.Unfortunately, the genres are unable to give us any new insight, unlike when comparing indie and non-indie games.Similarly, not a lot of patterns emerge, as the main pattern of the median increasing as popularity increases is consistent for all genres, except for 4 genres. These 4 genres are the "Sports" and "Massively Multiplayer" genres, which has a decreasing trend, as well as the "Casual" and "Free to Play" genres, with no clear trend.
Unfortunately, the genres are unable to give us any new insight, unlike when comparing indie and non-indie games.
xxxxxxxxxxNext, we can find the proportions of indie games of different popularities that have each tag.Next, we can find the proportions of indie games of different popularities that have each tag.
xxxxxxxxxxtags_df = pd.DataFrame({"Highest": pd.Series(indie_df.loc[indie_df.owners_binned == "Highest"].tags.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Highest"]), "High": pd.Series(indie_df.loc[indie_df.owners_binned == "High"].tags.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "High"]), "Low": pd.Series(indie_df.loc[indie_df.owners_binned == "Low"].tags.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Low"]), "Lowest": pd.Series(indie_df.loc[indie_df.owners_binned == "Lowest"].tags.str.replace("'", "") .str[1: -1].str.split(", ", expand=True).stack().values ).value_counts()/len(indie_df.loc[indie_df.owners_binned == "Lowest"]) }).drop(list(genres_df.index) + ["Indie"])*100tags_dfxxxxxxxxxxNow, we can plot a heatmap of the top 20 tags with the highest proportion of indie games of different popularities.Now, we can plot a heatmap of the top 20 tags with the highest proportion of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 20))sns.heatmap(tags_df.sort_values(by="Highest", ascending=False)[0:20], cmap="viridis", annot=True)plt.title("Top 20 tags with the highest proportion of indie games in the \"Highest\" bin")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()plt.figure(figsize=(20, 20))sns.heatmap(tags_df.sort_values(by="High", ascending=False)[0:20], cmap="viridis", annot=True)plt.title("Top 20 tags with the highest proportion of indie games in the \"High\" bin")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()plt.figure(figsize=(20, 20))sns.heatmap(tags_df.sort_values(by="Low", ascending=False)[0:20], cmap="viridis", annot=True)plt.title("Top 20 tags with the highest proportion of indie games in the \"Low\" bin")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()plt.figure(figsize=(20, 20))sns.heatmap(tags_df.sort_values(by="Lowest", ascending=False)[0:20], cmap="viridis", annot=True)plt.title("Top 20 tags with the highest proportion of indie games in the \"Lowest\" bin")plt.xlabel("Bins")plt.ylabel("Genres")plt.show()xxxxxxxxxx"Singleplayer" was again the most popular tag for both indie games from all 4 bins, having an increasing trend when popularity increases. However, similarly to when we compared indie and non-indie games, not many observations can be made, as many of the same tags can be visible in the top 20 tags of most or all 4 bins, which can be linked back to these tags just simply being the most popular overall.Therefore, for each of the 4 bins, we instead need to find the top 20 tags with the greatest ratio of games from that specific bin to the sum of games from other bins."Singleplayer" was again the most popular tag for both indie games from all 4 bins, having an increasing trend when popularity increases. However, similarly to when we compared indie and non-indie games, not many observations can be made, as many of the same tags can be visible in the top 20 tags of most or all 4 bins, which can be linked back to these tags just simply being the most popular overall.
Therefore, for each of the 4 bins, we instead need to find the top 20 tags with the greatest ratio of games from that specific bin to the sum of games from other bins.
xxxxxxxxxxtags_top = tags_df.loc[(tags_df.iloc[:, 0] >= tags_df.quantile(0.75).iloc[0]) | (tags_df.iloc[:, 1] >= tags_df.quantile(0.75).iloc[1]) | (tags_df.iloc[:, 2] >= tags_df.quantile(0.75).iloc[2]) | (tags_df.iloc[:, 3] >= tags_df.quantile(0.75).iloc[3])](tags_top.iloc[:, 0]/tags_top.iloc[:, [1, 2, 3]].sum(axis=1)).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games from the \"Highest\" bin to indie games in other bins")plt.show()(tags_top.iloc[:, 1]/tags_top.iloc[:, [0, 2, 3]].sum(axis=1)).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), color="orange", xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games from the \"High\" bin to indie games in other bins")plt.show()(tags_top.iloc[:, 2]/tags_top.iloc[:, [0, 1, 3]].sum(axis=1)).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), color="green", xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games from the \"Low\" bin to indie games in other bins")plt.show()(tags_top.iloc[:, 3]/tags_top.iloc[:, [0, 1, 2]].sum(axis=1)).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), color="red", xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games from the \"Lowest\" bin to indie games in other bins")plt.show()xxxxxxxxxxThere are some patterns that emerge among games of different popularities.For the "Highest" bin, games that require more manpower and resources to develop appear. For example, many of the tags are tags relating to multiplayer games, which require servers to run and can be difficult to set up without prior knowledge or resources, show up, such as the "Online Co-Op", "Co-Op", "Competitive", "Multiplayer", and "PvP" tags. Open world and sandbox games, which are much larger in scale compared to other types of games to incentivise players to explore and be creative in a world that has to be big enough, are also more prevalent among indie games in this bin, as seen by the "Open World Survival Craft", "Sandbox" and "Open World" tags. This is somewhat similar to what happened with non-indie games when we compared indie and non-indie games.On the other hand, for the "Lowest" bin, platformers were more popular, as seen by the "2D Platformer", "Precision Platformer" and "3D Platformer" tags. This could be due to how platformers have simpler types of gameplay that can be easier to develop than other games, similar to what happened with indie games when we compared indie and non-indie games.Other than that, there were no other trends that were really of note in the other bins. There were only types of games that were more popular in each bin for some unknown reason. For the "High" bin, turn-based games ("Turn-Based", "JRPG", "Turn-Based Combat", "Turn-Based Tactics", "Turn-Based Strategy) were more popular. For the "Low" bin, RPGs ("RPG Maker", "JRPG", "Turn-Based Tactics", "Turn-Based Combat") and interactive fiction games ("Visual Novel", "Interactive Fiction", "Multiple Endings", "Choose Your Own Adventure") were more popular.There are some patterns that emerge among games of different popularities.
For the "Highest" bin, games that require more manpower and resources to develop appear. For example, many of the tags are tags relating to multiplayer games, which require servers to run and can be difficult to set up without prior knowledge or resources, show up, such as the "Online Co-Op", "Co-Op", "Competitive", "Multiplayer", and "PvP" tags. Open world and sandbox games, which are much larger in scale compared to other types of games to incentivise players to explore and be creative in a world that has to be big enough, are also more prevalent among indie games in this bin, as seen by the "Open World Survival Craft", "Sandbox" and "Open World" tags. This is somewhat similar to what happened with non-indie games when we compared indie and non-indie games.
On the other hand, for the "Lowest" bin, platformers were more popular, as seen by the "2D Platformer", "Precision Platformer" and "3D Platformer" tags. This could be due to how platformers have simpler types of gameplay that can be easier to develop than other games, similar to what happened with indie games when we compared indie and non-indie games.
Other than that, there were no other trends that were really of note in the other bins. There were only types of games that were more popular in each bin for some unknown reason. For the "High" bin, turn-based games ("Turn-Based", "JRPG", "Turn-Based Combat", "Turn-Based Tactics", "Turn-Based Strategy) were more popular. For the "Low" bin, RPGs ("RPG Maker", "JRPG", "Turn-Based Tactics", "Turn-Based Combat") and interactive fiction games ("Visual Novel", "Interactive Fiction", "Multiple Endings", "Choose Your Own Adventure") were more popular.
xxxxxxxxxxFinally, we can find how the rating, playtime, price, number of languages and number of developers of a game affects the popularity of a game. To do this, we can find the correlation coefficient and P-value that rating, average playtime, price, number of languages and number of developers have with the total number of reviews.Finally, we can find how the rating, playtime, price, number of languages and number of developers of a game affects the popularity of a game. To do this, we can find the correlation coefficient and P-value that rating, average playtime, price, number of languages and number of developers have with the total number of reviews.
xxxxxxxxxxcolumns = ["rating", "avg_playtime", "price", "languages_count", "developers_count"]corr_dict = {}for col in columns: pearson_coef, p_value = stats.pearsonr(indie_df.dropna()[col], indie_df.dropna()["total_reviews"]) corr_dict[col] = [pearson_coef, p_value]pd.DataFrame(corr_dict, index=["pearson_coef", "p_value"]).TxxxxxxxxxxThe average playtime had the highest correlation coefficient, followed by number of languages, price, rating and finally number of developers. While the P-values of average playtime, number of languages, price and rating were very small values, the number of values did not. Since the number of developers had a low correlation coefficient and a high P-value, we can conclude that the number of developers has no correlation with the total number of reviews and does not contribute to the popularity of a game. On the other hand, average playtime has the strongest correlation with the total number of reviews and contributes the most to the popularity of a game, followed by number of languages, price and finally rating.We can also find the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity.The average playtime had the highest correlation coefficient, followed by number of languages, price, rating and finally number of developers. While the P-values of average playtime, number of languages, price and rating were very small values, the number of values did not.
Since the number of developers had a low correlation coefficient and a high P-value, we can conclude that the number of developers has no correlation with the total number of reviews and does not contribute to the popularity of a game. On the other hand, average playtime has the strongest correlation with the total number of reviews and contributes the most to the popularity of a game, followed by number of languages, price and finally rating.
We can also find the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity.
xxxxxxxxxxcolumns = ["rating", "avg_playtime", "price", "languages_count"]bins = ["Highest", "High", "Low", "Lowest"]pearson_coef_dict = {}for col in columns: pearson_coef_row = [] for b in bins: pearson_coef, p_value = stats.pearsonr(indie_df.loc[indie_df.owners_binned == b].dropna()[col], indie_df.loc[indie_df.owners_binned == b].dropna()["total_reviews"]) pearson_coef_row.append(pearson_coef) pearson_coef_dict[col] = pearson_coef_rowpd.DataFrame(pearson_coef_dict, index=bins)xxxxxxxxxxWe can now plot a graph of the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity, to show the changes in correlation coefficient as popularity changes.We can now plot a graph of the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity, to show the changes in correlation coefficient as popularity changes.
xxxxxxxxxxpd.DataFrame(pearson_coef_dict, index=bins).plot(kind="line", figsize=(20, 10), xlabel="Bins", ylabel="Correlation Coefficient", title="Correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity")plt.legend(["Rating", "Average Playtime", "Price", "Number of Languages"])plt.show()xxxxxxxxxxIn the "Lowest" bin, the correlation coefficients of all 4 variables were low, showing that the correlation of these variables might not be as strong with the least popular games.From the "Highest" bin to the "Low" bin, average playtime had a decreasing trend, price had an increasing trend, and rating and number of languages did not have any obvious trend. Average playtime had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are more popular. This can possibly be due to how if every game is high in quality, a successful game has to have more content and be larger in scale for it to stand out against its competition and succeed. On the other hand, price had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are less popular. This can possibly be due to how if a game is lacking in terms of content and quality, as a lot of less popular games are, the price would be the biggest factor when deciding which game players will purchase and which games will succeed, whereas if a game had more content and higher in quality, as a lot of more popular games are, any reasonable enough price would be suitable. Finally, rating and the number of languages had a constant correlation with the total number of reviews and the popularity if a game does not affect how much they contribute to the popularity of a game. This can be due to how the quality and accessibility of a game are useful factors in creating a succesful game, regardless how popular the game is.We can confirm this by training linear regression models and see if they fit the data. The models are trained in groups of 20, with random_state ranging from 0 to 19, and the best models are picked by hand.In the "Lowest" bin, the correlation coefficients of all 4 variables were low, showing that the correlation of these variables might not be as strong with the least popular games.
From the "Highest" bin to the "Low" bin, average playtime had a decreasing trend, price had an increasing trend, and rating and number of languages did not have any obvious trend. Average playtime had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are more popular. This can possibly be due to how if every game is high in quality, a successful game has to have more content and be larger in scale for it to stand out against its competition and succeed. On the other hand, price had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are less popular. This can possibly be due to how if a game is lacking in terms of content and quality, as a lot of less popular games are, the price would be the biggest factor when deciding which game players will purchase and which games will succeed, whereas if a game had more content and higher in quality, as a lot of more popular games are, any reasonable enough price would be suitable. Finally, rating and the number of languages had a constant correlation with the total number of reviews and the popularity if a game does not affect how much they contribute to the popularity of a game. This can be due to how the quality and accessibility of a game are useful factors in creating a succesful game, regardless how popular the game is.
We can confirm this by training linear regression models and see if they fit the data. The models are trained in groups of 20, with random_state ranging from 0 to 19, and the best models are picked by hand.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.dropna()["total_reviews"], test_size=0.2, random_state=7)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model")plt.xlabel("Total Number of Reviews")plt.text(x=150000, y=0.0003, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=150000, y=0.000275, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=150000, y=0.00025, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=150000, y=0.000225, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,500000], [0,500000], color="red")plt.title("Actual values against values from linear regression model")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxWe can better see the accuracy of the linear regression model by adjusting the axes.We can better see the accuracy of the linear regression model by adjusting the axes.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.dropna()["total_reviews"], test_size=0.2, random_state=7)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model")plt.xlabel("Total Number of Reviews")plt.xlim([-10000, 100000])plt.text(x=70000, y=0.0003, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=70000, y=0.000275, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=70000, y=0.00025, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=70000, y=0.000225, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,45000], [0,45000], color="red")plt.xlim([-1000, 50000])plt.ylim([-1000, 50000])plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. However, the linear regression model does become less accurate for games with a higher number of total reviews, as seen by both the kdeplot and scatterplot. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by average playtime, price and lastly rating.According to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. However, the linear regression model does become less accurate for games with a higher number of total reviews, as seen by both the kdeplot and scatterplot. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by average playtime, price and lastly rating.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Highest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Highest"] .dropna()["total_reviews"], test_size=0.2, random_state=4)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Highest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=450000, y=7e-6, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=450000, y=6.5e-6, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=450000, y=6e-6, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=450000, y=5.5e-6, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,500000], [0,500000], color="red")plt.title("Actual values against values from linear regression model in the \"Highest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. Interestingly, the coefficient of price is negative, compared to the other coefficients that are positive. This makes sense, as the lower the price, the more popular and successful the game would be. The average playtime had the largest magnitude of coefficient and contributes the most to the popularity of a game, followed by number of languages, rating, and lastly price.According to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. Interestingly, the coefficient of price is negative, compared to the other coefficients that are positive. This makes sense, as the lower the price, the more popular and successful the game would be. The average playtime had the largest magnitude of coefficient and contributes the most to the popularity of a game, followed by number of languages, rating, and lastly price.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "High"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "High"] .dropna()["total_reviews"], test_size=0.2, random_state=10)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"High\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=27500, y=0.000175, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=27500, y=0.0001625, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=27500, y=0.00015, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=27500, y=0.0001375, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,30000], [0,30000], color="red")plt.title("Actual values against values from linear regression model in the \"High\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by price, rating and lastly average playtime.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by price, rating and lastly average playtime.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Low"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Low"] .dropna()["total_reviews"], test_size=0.2, random_state=12)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Low\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=3000, y=0.002, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=3000, y=0.001875, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=3000, y=0.00175, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=3000, y=0.001625, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,4000], [0,4000], color="red")plt.title("Actual values against values from linear regression model in the \"Low\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The price had the largest coefficient and contributes the most to the popularity of a game, followed by number of languages, average playtime and lastly rating.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The price had the largest coefficient and contributes the most to the popularity of a game, followed by number of languages, average playtime and lastly rating.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()["total_reviews"], test_size=0.2, random_state=18)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Lowest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=5000, y=0.012, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=5000, y=0.0115, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=5000, y=0.011, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=5000, y=0.0105, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,3500], [0,3500], color="red")plt.title("Actual values against values from linear regression model in the \"Lowest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxWe can better see the accuracy of the linear regression model by adjusting the axes.We can better see the accuracy of the linear regression model by adjusting the axes.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()["total_reviews"], test_size=0.2, random_state=18)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Lowest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=700, y=0.012, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=700, y=0.0115, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=700, y=0.011, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=700, y=0.0105, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.xlim([-100, 1000])plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,900], [0,900], color="red")plt.title("Actual values against values from linear regression model in the \"Lowest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.xlim([-50, 1000])plt.ylim([-50, 1000])plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. The scatterplot also shows that the predicted values from the linear regression model does not follow the trend of the actual values very well. Therefore, we can conclude that the 4 variables will have a weaker correlation with the number of total reviews and a lesser effect on the success of a game. Therefore, we cannot use this linear regression model for analysis.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. The scatterplot also shows that the predicted values from the linear regression model does not follow the trend of the actual values very well. Therefore, we can conclude that the 4 variables will have a weaker correlation with the number of total reviews and a lesser effect on the success of a game. Therefore, we cannot use this linear regression model for analysis.
xxxxxxxxxx### Q4: What are factors that indie game developers have to consider when developing an indie game? <a id="Q4_EDA"></a>xxxxxxxxxxFirstly, we can analyse if developers view their game as just a passion project, or an actual legitimate game. We can do this by plotting the proportion of itch.io games that are free and paid.Firstly, we can analyse if developers view their game as just a passion project, or an actual legitimate game. We can do this by plotting the proportion of itch.io games that are free and paid.
xxxxxxxxxxfree_or_paid = pd.Series({"Free": itchio_df.Price.value_counts()[0.00], "Paid": itchio_df.Price.value_counts().sum()-itchio_df.Price.value_counts()[0.00]})plt.figure(figsize=(7, 7))plt.pie( free_or_paid, labels=free_or_paid.index, autopct='%1.1f%%' )plt.title("Proportion of itch.io games that are free and paid")plt.show()xxxxxxxxxxMajority of the itch.io games are free, thus we can conclude that indie game development is still widely seen as a hobby, rather than an actual career path to make money.We can also plot the proportion of itch.io games by length.Majority of the itch.io games are free, thus we can conclude that indie game development is still widely seen as a hobby, rather than an actual career path to make money.
We can also plot the proportion of itch.io games by length.
xxxxxxxxxxplt.figure(figsize=(7, 7))labels = ["A few seconds", "A few minutes", "About a half-hour", "About an hour", "A few hours", "Days or more"]plt.pie( itchio_df.loc[:, "Average session"].value_counts()[labels], labels=labels, autopct='%1.1f%%' )plt.title("Proportion of itch.io games by length")plt.show()xxxxxxxxxxMost of the itch.io games are very short, with a majority of games only lasting a few minutes, thus we can conclude that most itch.io games are made as short passion projects, either to improve their skills or as a hobby, rather than actual long games with a lot of content.Most of the itch.io games are very short, with a majority of games only lasting a few minutes, thus we can conclude that most itch.io games are made as short passion projects, either to improve their skills or as a hobby, rather than actual long games with a lot of content.
xxxxxxxxxxNext, we can compare the most popular genres and tags of itch.io games to find out what types of itch.io games are being produced.Firstly, we need to find the number of itch.io games that are in each genre.Next, we can compare the most popular genres and tags of itch.io games to find out what types of itch.io games are being produced.
Firstly, we need to find the number of itch.io games that are in each genre.
xxxxxxxxxxgenres_series = pd.Series(itchio_df.Genre.str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()genres_seriesxxxxxxxxxxWe can now plot the number of itch.io games that are in each of the genres.We can now plot the number of itch.io games that are in each of the genres.
xxxxxxxxxxgenres_series.plot(kind="bar", figsize=(20, 10), xlabel="Genres", ylabel="Number of Games", title="Number of itch.io games that are in each of the genres")plt.show()xxxxxxxxxxMany of the genres that were popular among Steam indie games were also popular among itch.io games, such as the "Adventure", "Puzzle", "Action" and "Platformer" genres. This confirms that these genres are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.What is worth noting is that interactive fiction games were popular among itch.io games, as seen by the "Visual Novel" and "Interactive Fiction" genres. This is similar to how interactive fiction games were also popular among Steam indie games in the "Low" bin. This could perhaps mean that indie games in itch.io are the most similar to Steam indie games in the "Low" bin.We can also find the number of itch.io games that have each tag.Many of the genres that were popular among Steam indie games were also popular among itch.io games, such as the "Adventure", "Puzzle", "Action" and "Platformer" genres. This confirms that these genres are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.
What is worth noting is that interactive fiction games were popular among itch.io games, as seen by the "Visual Novel" and "Interactive Fiction" genres. This is similar to how interactive fiction games were also popular among Steam indie games in the "Low" bin. This could perhaps mean that indie games in itch.io are the most similar to Steam indie games in the "Low" bin.
We can also find the number of itch.io games that have each tag.
xxxxxxxxxxtags_series = pd.Series(itchio_df.Tags.str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()tags_seriesxxxxxxxxxxWe can now plot the number of itch.io games that have each tag.We can now plot the number of itch.io games that have each tag.
xxxxxxxxxxtags_series[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Number of Games", title="Number of itch.io games that have each tag")plt.show()xxxxxxxxxxSimilar to the genres, many of the tags that were popular among Steam indie games were also popular among itch.io games, such as the "2D", "Pixel Art", "Singleplayer", "Short", "3D" and "Cute" tags. This confirms that these tags are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.Therefore, we can conclude that the patterns in the types of indie games produced we observed in the Steam games is also consistent in itch.io games, thus the decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on the commercial scale are also applicable in general indie game development, regardless if it is to create a legitimate game or if it is just as a hobby or passion project.Similar to the genres, many of the tags that were popular among Steam indie games were also popular among itch.io games, such as the "2D", "Pixel Art", "Singleplayer", "Short", "3D" and "Cute" tags. This confirms that these tags are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.
Therefore, we can conclude that the patterns in the types of indie games produced we observed in the Steam games is also consistent in itch.io games, thus the decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on the commercial scale are also applicable in general indie game development, regardless if it is to create a legitimate game or if it is just as a hobby or passion project.
xxxxxxxxxxWe can analyse how useful are external tools and software in indie game development by plotting the proportion of itch.io games by number of tools and software used. These tools and software are software that can help in the development of indie games in many different areas, such as programming, graphics and sound.We can analyse how useful are external tools and software in indie game development by plotting the proportion of itch.io games by number of tools and software used. These tools and software are software that can help in the development of indie games in many different areas, such as programming, graphics and sound.
xxxxxxxxxxplt.figure(figsize=(7, 7))tool_counts = itchio_df.tools_count.value_counts()tool_counts.iloc[5] = tool_counts.iloc[5:].sum()plt.pie( tool_counts.iloc[0:6], labels=[0, 1, 2, 3, 4, "5+"], autopct='%1.1f%%' )plt.title("Proportion of itch.io games by number of tools and software used")plt.show()xxxxxxxxxxAs the number of tools and software increases, the proportion of itch.io games decreases. However, the proportion of itch.io games that used 1 tool was almost completely equal to the proportion of itch.io games that used no tools, being only 0.1% lesser. There is also a greater proportion of itch.io games that used at least 1 tool compared to itch.io games that used no tools at all. This shows that the use of external tools and software in indie game development is useful among itch.io games and is external tools and software are used quite frequently in indie game development.We can also analyse the most popular tools and software being used in indie game development by finding the number of itch.io games that use each tool.As the number of tools and software increases, the proportion of itch.io games decreases. However, the proportion of itch.io games that used 1 tool was almost completely equal to the proportion of itch.io games that used no tools, being only 0.1% lesser. There is also a greater proportion of itch.io games that used at least 1 tool compared to itch.io games that used no tools at all. This shows that the use of external tools and software in indie game development is useful among itch.io games and is external tools and software are used quite frequently in indie game development.
We can also analyse the most popular tools and software being used in indie game development by finding the number of itch.io games that use each tool.
xxxxxxxxxxtools_series = pd.Series(itchio_df.loc[:, "Made with"].str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()tools_seriesxxxxxxxxxxWe can now plot the top 20 tools and softwares among itch.io games.We can now plot the top 20 tools and softwares among itch.io games.
xxxxxxxxxxtools_series[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tools and Softwares", ylabel="Number of Games", title="Top 20 tools and softwares among itch.io games")plt.show()xxxxxxxxxxUnity was the most popular tool by far, having a far greater number of itch.io games than Bitsy, the tool with the second most games. This makes sense, as Unity is widely considered to be the best game development software for beginners due to how easy it is to use, thus it will be perfect for inexperienced indie game developers.Out of the top 20 tools and softwares, 12 were game engines and were for programming, 6 were tools that can be used to make art assets and were for graphics, and 2 were for creating audio and sound effects. The tools and software for game engines and programming are Unity, Bitsy, RenPy, GameMaker: Studio, Twine, Construct, PICO-8, Godot, RPG Maker, Unreal Engine, OpenFL and PuzzleScript. The tools and software for graphics and art are Adobe Photoshop, Aseprite, Blender, GIMP, Clip Studio Paint and Paint.net. Lastly, the tools and software for audio and sound effects are Audacity and FL Studio.Therefore, we can conclude that tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.Unity was the most popular tool by far, having a far greater number of itch.io games than Bitsy, the tool with the second most games. This makes sense, as Unity is widely considered to be the best game development software for beginners due to how easy it is to use, thus it will be perfect for inexperienced indie game developers.
Out of the top 20 tools and softwares, 12 were game engines and were for programming, 6 were tools that can be used to make art assets and were for graphics, and 2 were for creating audio and sound effects. The tools and software for game engines and programming are Unity, Bitsy, RenPy, GameMaker: Studio, Twine, Construct, PICO-8, Godot, RPG Maker, Unreal Engine, OpenFL and PuzzleScript. The tools and software for graphics and art are Adobe Photoshop, Aseprite, Blender, GIMP, Clip Studio Paint and Paint.net. Lastly, the tools and software for audio and sound effects are Audacity and FL Studio.
Therefore, we can conclude that tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.
xxxxxxxxxxFinally, we can analyse how accessible itch.io games are by finding the proportions of itch.io games by the number of platforms, the number of languages, the number of inputs and the number of accessibility options.Finally, we can analyse how accessible itch.io games are by finding the proportions of itch.io games by the number of platforms, the number of languages, the number of inputs and the number of accessibility options.
xxxxxxxxxxplt.figure(figsize=(7, 7))plt.pie( itchio_df.platforms_count.value_counts().loc[[0, 1, 2, 3, 4, 5]], labels=[0, 1, 2, 3, 4, 5], autopct='%1.1f%%' )plt.title("Proportions of itch.io games by the number of platforms")plt.show()xxxxxxxxxxWhile there was an overall decreasing trend in the proportion of itch.io games as the number of platforms increases, the majority of itch.io games supported 1 platform, instead of the proportion of itch.io games that supported no platforms. There was also a greater proportion of itch.io games that supported 3 platforms compared to the proportion of itch.io games that supported 2 platforms.To find the reason for these trends, we can find the number of itch.io games that support each platform.While there was an overall decreasing trend in the proportion of itch.io games as the number of platforms increases, the majority of itch.io games supported 1 platform, instead of the proportion of itch.io games that supported no platforms. There was also a greater proportion of itch.io games that supported 3 platforms compared to the proportion of itch.io games that supported 2 platforms.
To find the reason for these trends, we can find the number of itch.io games that support each platform.
xxxxxxxxxxplatforms_series = pd.Series(itchio_df.Platforms.str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()platforms_seriesxxxxxxxxxxWe can now plot the number of itch.io games that support each platform.We can now plot the number of itch.io games that support each platform.
xxxxxxxxxxplatforms_series.plot(kind="bar", figsize=(20, 10), xlabel="Platforms", ylabel="Number of Games", title="Number of itch.io games that support each platform")plt.show()xxxxxxxxxxThe most popular platform was Windows, with HTML5, macOS, and Linux also having a large proportion of itch.io games. This can be due to how Windows, macOS and Linux are the most popular computer operating systems, while HTML5 is a massively popular markup language that is already used for structuring and presenting content thoughout the Internet. This can also be why there is a greater proportion of itch.io games that supported 3 platforms compared to the proportion of itch.io games that supported 2 platforms, as it is likely that the itch.io games that supported 3 platforms supported the 3 most popular computer operating systems, Windows, macOS, and Linux.Therefore, we can conclude that other than the most popular computer operating systems and HTML5, which is already a popular markup language on the Internet, not many other platforms are supported by itch.io games.The most popular platform was Windows, with HTML5, macOS, and Linux also having a large proportion of itch.io games. This can be due to how Windows, macOS and Linux are the most popular computer operating systems, while HTML5 is a massively popular markup language that is already used for structuring and presenting content thoughout the Internet. This can also be why there is a greater proportion of itch.io games that supported 3 platforms compared to the proportion of itch.io games that supported 2 platforms, as it is likely that the itch.io games that supported 3 platforms supported the 3 most popular computer operating systems, Windows, macOS, and Linux.
Therefore, we can conclude that other than the most popular computer operating systems and HTML5, which is already a popular markup language on the Internet, not many other platforms are supported by itch.io games.
xxxxxxxxxxplt.figure(figsize=(7, 7))languages_counts = itchio_df.languages_count.value_counts()languages_counts.iloc[3] = languages_counts.iloc[3:].sum()plt.pie( languages_counts.iloc[0:4], labels=[0, 1, 2, "3+"], autopct='%1.1f%%' )plt.title("Proportions of itch.io games by the number of languages")plt.show()xxxxxxxxxxplt.figure(figsize=(7, 7))inputs_counts = itchio_df.inputs_count.value_counts()inputs_counts.iloc[5] = inputs_counts.iloc[5:].sum()plt.pie( inputs_counts.iloc[0:6], labels=[0, 1, 2, 3, 4, "5+"], autopct='%1.1f%%' )plt.title("Proportions of itch.io games by the number of inputs")plt.show()xxxxxxxxxxplt.figure(figsize=(7, 7))accessibility_counts = itchio_df.accessibility_count.value_counts()accessibility_counts.iloc[3] = accessibility_counts.iloc[3:].sum()plt.pie( accessibility_counts.iloc[0:4], labels=[0, 1, 2, "3+"], autopct='%1.1f%%' )plt.title("Proportions of itch.io games by the number of accessibility options")plt.show()xxxxxxxxxxThe number of languages, the number of inputs and the number of accessibility options had the same trend, where the proportion of indie games decreased as the number of languages, the number of inputs and the number of accessibility options increased. This can either be due to indie game either developers not having enough resources or manpower to implement these options to make their games more accessible or not listing these extra information on the store page, which is where this data was scraped from. Regardless of the reason, we can conclude that itch.io indie games are not very accessible.The number of languages, the number of inputs and the number of accessibility options had the same trend, where the proportion of indie games decreased as the number of languages, the number of inputs and the number of accessibility options increased. This can either be due to indie game either developers not having enough resources or manpower to implement these options to make their games more accessible or not listing these extra information on the store page, which is where this data was scraped from. Regardless of the reason, we can conclude that itch.io indie games are not very accessible.
xxxxxxxxxx## Q1: How popular are indie games compared to AAA games? <a id="Q1"></a>xxxxxxxxxxWe can find the rise in popularity of indie games and non-indie games by plotting the total concurrent players from indie and non-indie games against time. The total concurrent players in a month can be estimated by the sum of the average concurrent players of every game in that month. A rolling average is used to smoothen out the graph.We can find the rise in popularity of indie games and non-indie games by plotting the total concurrent players from indie and non-indie games against time. The total concurrent players in a month can be estimated by the sum of the average concurrent players of every game in that month. A rolling average is used to smoothen out the graph.
xxxxxxxxxxindie = avg_players_df.loc[avg_players_df.is_indie].iloc[:, 3:].sum()non_indie = avg_players_df.loc[~avg_players_df.is_indie].iloc[:, 3:].sum()total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie})pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Total concurrent players from indie games and non-indie games against time", ylabel="Number of concurrent players", xlabel="Year")plt.annotate(text="Non-indie games spike in 2018", xy=(datetime(2018, 3, 1), 3.8e6), xytext=(datetime(2018, 3, 1), 5e6), arrowprops={"arrowstyle": "->", "connectionstyle": "arc3", "lw": 3})plt.show()xxxxxxxxxxBoth indie games and non-indie games have an steady increasing trend in the total number of concurrent players from 2013 to 2020. However, non-indie games had a spike in total concurrent players from late 2017 to early 2018, before returning to the normal rate of increase at late 2018. At 2020, the rate of growth of total concurrent players for both indie and non-indie games accelerated, increasing at a faster rate with a slight amount of oscillation.However, in order to find the rise in popularity of indie games relative to non-indie games, we have to plot the proportion of concurrent players from indie and non-indie games against time, rather than the total number of concurrent players.Both indie games and non-indie games have an steady increasing trend in the total number of concurrent players from 2013 to 2020. However, non-indie games had a spike in total concurrent players from late 2017 to early 2018, before returning to the normal rate of increase at late 2018. At 2020, the rate of growth of total concurrent players for both indie and non-indie games accelerated, increasing at a faster rate with a slight amount of oscillation.
However, in order to find the rise in popularity of indie games relative to non-indie games, we have to plot the proportion of concurrent players from indie and non-indie games against time, rather than the total number of concurrent players.
xxxxxxxxxxindie = avg_players_df.loc[avg_players_df.is_indie].iloc[:, 3:].sum()non_indie = avg_players_df.loc[~avg_players_df.is_indie].iloc[:, 3:].sum()total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie})total = total_players_df.sum(axis=1)total_players_df.loc[:, "Indie"] = total_players_df.loc[:, "Indie"]/total*100total_players_df.loc[:, "Non-indie"] = 100-total_players_df.loc[:, "Indie"]pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Proportion of concurrent players from indie games and non-indie games against time", ylabel="Proportion of concurrent players", xlabel="Year")plt.show()xxxxxxxxxxThe proportion of concurrent players from indie games had a steady increase from around 12% in 2013 to around 22% in 2022. Since the proportion of concurrent players from indie games increased over time, we can imply that indie games has had a greater rate of growth than non-indie games. There was also a small dip in 2018, which is explained by the spike in total concurrent players that non-indie games had.We can also plot the proportion of concurrent players from indie and non-indie games against time for games with different popularity levels.The proportion of concurrent players from indie games had a steady increase from around 12% in 2013 to around 22% in 2022. Since the proportion of concurrent players from indie games increased over time, we can imply that indie games has had a greater rate of growth than non-indie games. There was also a small dip in 2018, which is explained by the spike in total concurrent players that non-indie games had.
We can also plot the proportion of concurrent players from indie and non-indie games against time for games with different popularity levels.
xxxxxxxxxxfor grp in ["Highest", "High", "Low", "Lowest"]: indie = avg_players_df.loc[avg_players_df.is_indie & (avg_players_df.owners_binned == grp)].iloc[:, 3:].sum() non_indie = avg_players_df.loc[~avg_players_df.is_indie & (avg_players_df.owners_binned == grp)].iloc[:, 3:].sum() total_players_df = pd.DataFrame({"Indie": indie, "Non-indie": non_indie}) total = total_players_df.sum(axis=1) total_players_df.loc[:, "Indie"] = total_players_df.loc[:, "Indie"]/total*100 total_players_df.loc[:, "Non-indie"] = 100-total_players_df.loc[:, "Indie"] pd.concat([total_players_df[["Indie"]].rolling(6).mean().dropna(), total_players_df[["Non-indie"]].rolling(6).mean().dropna()], axis=1).plot(kind="area", stacked=True, figsize=(20, 10), title="Proportion of concurrent players from indie games and non-indie games against time in the \""+grp+"\" bin", ylabel="Proportion of concurrent players", xlabel="Year") plt.show()xxxxxxxxxxThe proportion of concurrent players from indie games increased over time, regardless of the popularity level of the games. However, games that were less popular had a greater increase in the proportion of concurrent players from indie games over time.Therefore, we can infer that the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, especially for less popular games.The proportion of concurrent players from indie games increased over time, regardless of the popularity level of the games. However, games that were less popular had a greater increase in the proportion of concurrent players from indie games over time.
Therefore, we can infer that the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, especially for less popular games.
xxxxxxxxxxNext, we can plot the total number of indie and non-indie games released against time.Next, we can plot the total number of indie and non-indie games released against time.
xxxxxxxxxxdate_freq = steam_df.groupby(["is_indie", "date"])[["id"]].count().reset_index()date_freq = pd.pivot_table(date_freq, values="id", index=["date"], columns="is_indie").fillna(0)date_freq.loc[:, True] = date_freq.loc[:, True].cumsum()date_freq.loc[:, False] = date_freq.loc[:, False].cumsum()date_freq.iloc[:, [1, 0]].plot(kind="area", figsize=(20, 10), stacked=True, title="Total number of indie and non-indie games released against time", ylabel="Number of games", xlabel="Year", xlim=[datetime(2000, 1, 1), datetime(2022, 12, 31)])plt.show()xxxxxxxxxxBoth indie and non-indie games had an increasing trend over time. The total number of indie games had an exponential growth from 2008 onwards, quickly surpassing the total number of non-indie games in 2015. This exponential growth can be better visualised if we instead plot the proportion of indie and non-indie games released against time.Both indie and non-indie games had an increasing trend over time. The total number of indie games had an exponential growth from 2008 onwards, quickly surpassing the total number of non-indie games in 2015. This exponential growth can be better visualised if we instead plot the proportion of indie and non-indie games released against time.
xxxxxxxxxxdate_freq.loc[:, True] = date_freq.loc[:, True]/(date_freq.loc[:, True]+date_freq.loc[:, False])*100date_freq.loc[:, False] = 100-date_freq.loc[:, True]date_freq.iloc[:, [1, 0]].plot(kind="area", figsize=(20, 10), stacked=True, title="Proportion of indie and non-indie games released against time", ylabel="Proportion of games", xlabel="Year", sort_columns=[True, False], xlim=[datetime(2000, 1, 1), datetime(2022, 12, 31)])plt.annotate(text="Start of exponential growth at 2008", xy=(datetime(2008, 1, 1), 15), xytext=(datetime(2008, 1, 1), 40), arrowprops={"arrowstyle": "->", "connectionstyle": "arc3", "lw": 3})plt.show()xxxxxxxxxxThe proportion of indie games released increased from less than 10% in 2000, to around 75% in 2022. Here, we can clearly see the exponential growth from 2008 onwards, and when the proportion of indie games released reaching 50% at 2015. Due to the exponential growth of the number of indie games released from 2008 onwards, as well as how great the rate of growth indie games have in relative to non-indie games, we can conclude that the demand for indie games and the prevalence of indie games truly started to increase rapidly from 2008 onwards.The proportion of indie games released increased from less than 10% in 2000, to around 75% in 2022. Here, we can clearly see the exponential growth from 2008 onwards, and when the proportion of indie games released reaching 50% at 2015. Due to the exponential growth of the number of indie games released from 2008 onwards, as well as how great the rate of growth indie games have in relative to non-indie games, we can conclude that the demand for indie games and the prevalence of indie games truly started to increase rapidly from 2008 onwards.
xxxxxxxxxxHowever, there is another way to compare the popularities of indie and non-indie games. By plotting boxplots of the total number of reviews of indie and non-indie games at different levels of popularity, we can infer if indie games are comparable in size and popularity to non-indie games. We have to seperate the different levels of popularity into different boxplots, due to the differences in the y-axis.However, there is another way to compare the popularities of indie and non-indie games. By plotting boxplots of the total number of reviews of indie and non-indie games at different levels of popularity, we can infer if indie games are comparable in size and popularity to non-indie games. We have to seperate the different levels of popularity into different boxplots, due to the differences in the y-axis.
xxxxxxxxxxfor grp in ["Highest", "High", "Low", "Lowest"]: plt.figure(figsize=(20, 10)) bin_df = steam_df.loc[steam_df.owners_binned == grp] sns.boxplot(data=bin_df, y="total_reviews", x="is_indie", order=[True, False], showfliers=False) plt.ylabel("Total number of reviews") plt.title("Distribution of the total number of reviews of indie games and non-indie games in the \""+grp+"\" bin") plt.show()xxxxxxxxxxIn all 4 bins, the median of indie games are higher than the median of non-indie games. Both the indie games and non-indie games have distributions that are skewed to the right for all 4 bins. The IQR of indie games was larger than that of non-indie games in the "Highest" and "Low" bins, and vice versa for the "High" and "Lowest" bins.The median of the total number of reviews of indie games are consistently higher than that of non-indie games, thus we can imply that indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.In all 4 bins, the median of indie games are higher than the median of non-indie games. Both the indie games and non-indie games have distributions that are skewed to the right for all 4 bins. The IQR of indie games was larger than that of non-indie games in the "Highest" and "Low" bins, and vice versa for the "High" and "Lowest" bins.
The median of the total number of reviews of indie games are consistently higher than that of non-indie games, thus we can imply that indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.
xxxxxxxxxxTo summarise, the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, where more and more players are starting to play indie games as time goes on. This is especially true for less popular games, where the rise in popularity is more rapid. As a result, the demand of indie games has risen exponentially, with many more indie games getting released in the present than in the past. This increasing trend started at around 2008, where the number of indie games released started to grow exponentially. While non-indie games might still have more players than indie games, indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.To summarise, the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, where more and more players are starting to play indie games as time goes on. This is especially true for less popular games, where the rise in popularity is more rapid. As a result, the demand of indie games has risen exponentially, with many more indie games getting released in the present than in the past. This increasing trend started at around 2008, where the number of indie games released started to grow exponentially. While non-indie games might still have more players than indie games, indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.
xxxxxxxxxx## Q2: What are the major differences between indie games and AAA games? <a id="Q2"></a>xxxxxxxxxxFirstly, we can compare the quality of indie and non-indie games by plotting the distribution of rating of indie and non-indie games.Firstly, we can compare the quality of indie and non-indie games by plotting the distribution of rating of indie and non-indie games.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="rating", x="is_indie", order=[True, False])plt.title("Distribution of rating of indie and non-indie games")plt.ylabel("Rating")plt.show()xxxxxxxxxxIndie games have a higher median than non-indie games. Both indie and non-indie games have distributions that were skewed to the left. Indie games have a smaller IQR than non-indie games. Both indie and non-indie games have outliers below the lower bound.This pattern is also consistent at different popularity levels.Indie games have a higher median than non-indie games. Both indie and non-indie games have distributions that were skewed to the left. Indie games have a smaller IQR than non-indie games. Both indie and non-indie games have outliers below the lower bound.
This pattern is also consistent at different popularity levels.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.boxplot(data=steam_df, y="rating", x="owners_binned", order=["Highest", "High", "Low", "Lowest"], hue="is_indie", hue_order=[True, False])plt.title("Distribution of rating of indie and non-indie games at different popularity levels")plt.ylabel("Rating")plt.show()xxxxxxxxxxIn all 4 bins, indie games have a higher median than non-indie games, both indie and non-indie games have distributions that were skewed to the left, indie games have a smaller IQR than non-indie games, and both indie and non-indie games have outliers below the lower bound.It is also worth noting that indie and non-indie games in the "Highest" bin have the greatest difference in rating medians compared to the other 3 bins, which can show that the as the level of popularity increases, the difference in quality between indie and non-indie games becomes larger, where more popular indie games would be much more well-received compared to other non-indie games of around the same popularity.Regardless of the level of popularity, indie games are overall more enjoyable and more positively received than non-indie games, as seen by the higher median of indie games. There is also less variation in quality in indie games than in non-indie games, shown by the smaller IQR of indie games.In all 4 bins, indie games have a higher median than non-indie games, both indie and non-indie games have distributions that were skewed to the left, indie games have a smaller IQR than non-indie games, and both indie and non-indie games have outliers below the lower bound.
It is also worth noting that indie and non-indie games in the "Highest" bin have the greatest difference in rating medians compared to the other 3 bins, which can show that the as the level of popularity increases, the difference in quality between indie and non-indie games becomes larger, where more popular indie games would be much more well-received compared to other non-indie games of around the same popularity.
Regardless of the level of popularity, indie games are overall more enjoyable and more positively received than non-indie games, as seen by the higher median of indie games. There is also less variation in quality in indie games than in non-indie games, shown by the smaller IQR of indie games.
xxxxxxxxxxWe can plot the proportions of indie and non-indie games that are in each of the top 11 genres to see what types of indie and non-indie games are being developed.We can plot the proportions of indie and non-indie games that are in each of the top 11 genres to see what types of indie and non-indie games are being developed.
xxxxxxxxxxgenres_df = pd.DataFrame({True: pd.Series(steam_df.loc[steam_df.is_indie].genres.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[is_indie]), False: pd.Series(steam_df.loc[~steam_df.is_indie].genres.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[~is_indie]) }).drop("Indie")*100genres_df = genres_df.sort_values(by=True, ascending=False)genres_df = genres_df[0:11]genres_df.plot(kind="bar", figsize=(20, 10), xlabel="Genre", ylabel="Proportion of Games", title="Proportions of indie and non-indie games in each of the top 11 genres")plt.legend().set_title("is_indie")plt.show()xxxxxxxxxxThe "Action", "Casual" and "Adventure" genres were the 3 top genres for both indie and non-indie games. However, there is a higher proportion of indie games that are in these 3 top genres compared to non-indie games. Other than the top 3 genres, there is also a higher proportion of indie games in the "RPG" and "Early Access" genres, whereas the "Strategy", "Simulation", "Free to Play", "Sports", "Racing" and "Massively Multiplayer" genres have a higher proportion of non-indie games.This graph shows us that indie games is not as diverse in its genres compared to non-indie games, as seen by a higher proportion of indie games being in top 3 genres instead of having a more even distribution. This can be due to the limitations that indie games face but non-indie games do not, restricting the genre of game indie game developers can produce. For example, games in the "Simulation", "Sports" and "Racing" genres might require a level of realism in terms of graphics and gameplay, which might require more resources and manpower that indie games do not have. Games in the "Strategy" genre might require more complicated and in-depth game mechanics to keep players hooked, while games in the "Massively Multiplayer" genre would require running servers to support multiplayer, both of which might be difficult for an individual to implement if they do not have the prior knowledge and resources. Therefore, there are some types of games that would be more difficult to produce and implement, which indie games tend to stay away from due to limitations in resources and manpower.The "Action", "Casual" and "Adventure" genres were the 3 top genres for both indie and non-indie games. However, there is a higher proportion of indie games that are in these 3 top genres compared to non-indie games. Other than the top 3 genres, there is also a higher proportion of indie games in the "RPG" and "Early Access" genres, whereas the "Strategy", "Simulation", "Free to Play", "Sports", "Racing" and "Massively Multiplayer" genres have a higher proportion of non-indie games.
This graph shows us that indie games is not as diverse in its genres compared to non-indie games, as seen by a higher proportion of indie games being in top 3 genres instead of having a more even distribution. This can be due to the limitations that indie games face but non-indie games do not, restricting the genre of game indie game developers can produce. For example, games in the "Simulation", "Sports" and "Racing" genres might require a level of realism in terms of graphics and gameplay, which might require more resources and manpower that indie games do not have. Games in the "Strategy" genre might require more complicated and in-depth game mechanics to keep players hooked, while games in the "Massively Multiplayer" genre would require running servers to support multiplayer, both of which might be difficult for an individual to implement if they do not have the prior knowledge and resources.
Therefore, there are some types of games that would be more difficult to produce and implement, which indie games tend to stay away from due to limitations in resources and manpower.
xxxxxxxxxxHowever, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. We can see how by finding the top 20 tags with the greatest ratio of indie games to non-indie games, and vice versa.However, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. We can see how by finding the top 20 tags with the greatest ratio of indie games to non-indie games, and vice versa.
xxxxxxxxxxtags_df = pd.DataFrame({True: pd.Series(steam_df.loc[steam_df.is_indie].tags.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[is_indie]), False: pd.Series(steam_df.loc[~steam_df.is_indie].tags.str.replace("'", "").str[1: -1].str .split(", ", expand=True).stack().values).value_counts()/len(steam_df.loc[~is_indie]) }).drop(list(genres_df.index) + ["Indie"])*100tags_top = tags_df.loc[(tags_df.iloc[:, 0] >= tags_df.quantile(0.75).iloc[0]) | (tags_df.iloc[:, 1] >= tags_df.quantile(0.75).iloc[1])](tags_top.iloc[:, 0]/tags_top.iloc[:, 1]).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of indie games to non-indie games")plt.show()(tags_top.iloc[:, 1]/tags_top.iloc[:, 0]).sort_values(ascending=False)[0:20].plot(kind="bar", figsize=(20, 10), color="orange", xlabel="Tags", ylabel="Ratio", title="Top 20 tags with the greatest ratio of non-indie games to indie games")plt.show()xxxxxxxxxxNow, we can finally see some patterns in the types of games being developed as indie and non-indie games.For indie games, "Short" is the top tag by quite a margin. This makes sense, as indie game developers usually do not have the resources or manpower to create extremely long games with a lot of content. However, some of the other tags do give us an idea of how indie game developers solve these problems. For example, some indie games contain some sort of procedural generation, which is a algorithmic process of generating gameplay. This can allow the gameplay to feel fresh and unrepetitive without the need for the human touch, increasing the replay value of indie games. As it turns out, "Procedural Generation" and "Replay Value" are both included in the top 20 tags for indie games. Some examples of games that use procedural generation are roguelikes and roguelites, which also both appear as the tags "Roguelike" and "Roguelite". Some indie games can also make gameplay more fun is by making it more difficult or fast-paced and requiring time to master, which can explain the tags "Difficult" and "Fast-Paced". Some of these games can include "Bullet Hell", "Top-Down Shooter" and "Shoot Em Up", which also appear as tags. Finally, platformers and puzzle games are quite popular among indie games, with the tags "Puzzle Platformer", "Platformer", "Logic" and "Puzzle" all appearing in the top 20. This can be due to puzzle and platformers usually having simpler types of gameplay than other types of games.On the other hand, for non-indie games, "Classic" is the top tag by quite a margin. This can be due to many non-indie games that are seen as classics or having recognisable characters in them. There are also some types of games that are more complicated and require more resources and manpower. There are the "Historical", "Military", "War" and "Driving" tags, where these games have to as realistic as possible, making them complex to develop. There are the "RTS", "JRPG", "Turn-Based Strategy" and "Tactical" tags, which have to have some in-depth strategy and enough balancing to create interesting gameplay. There are also games with the "Open World" tag that are usually at very large scales, as they have to incentivise players to explore a world that has to be large enough. Lastly, there are the "Multiplayer", "Online Co-Op", "VR", "PvP" and "Co-op" tags, which would require external software, such as servers and VR headsets, in order to run. Therefore, these graphs and tags show that in order to combat the lack of resources and manpower, there are some patterns that emerge among indie games, such as making gameplay more unique or interesting, as well as sticking to types of games that are easier to develop over other types that can be more difficult to develop.Now, we can finally see some patterns in the types of games being developed as indie and non-indie games.
For indie games, "Short" is the top tag by quite a margin. This makes sense, as indie game developers usually do not have the resources or manpower to create extremely long games with a lot of content. However, some of the other tags do give us an idea of how indie game developers solve these problems. For example, some indie games contain some sort of procedural generation, which is a algorithmic process of generating gameplay. This can allow the gameplay to feel fresh and unrepetitive without the need for the human touch, increasing the replay value of indie games. As it turns out, "Procedural Generation" and "Replay Value" are both included in the top 20 tags for indie games. Some examples of games that use procedural generation are roguelikes and roguelites, which also both appear as the tags "Roguelike" and "Roguelite". Some indie games can also make gameplay more fun is by making it more difficult or fast-paced and requiring time to master, which can explain the tags "Difficult" and "Fast-Paced". Some of these games can include "Bullet Hell", "Top-Down Shooter" and "Shoot Em Up", which also appear as tags. Finally, platformers and puzzle games are quite popular among indie games, with the tags "Puzzle Platformer", "Platformer", "Logic" and "Puzzle" all appearing in the top 20. This can be due to puzzle and platformers usually having simpler types of gameplay than other types of games.
On the other hand, for non-indie games, "Classic" is the top tag by quite a margin. This can be due to many non-indie games that are seen as classics or having recognisable characters in them. There are also some types of games that are more complicated and require more resources and manpower. There are the "Historical", "Military", "War" and "Driving" tags, where these games have to as realistic as possible, making them complex to develop. There are the "RTS", "JRPG", "Turn-Based Strategy" and "Tactical" tags, which have to have some in-depth strategy and enough balancing to create interesting gameplay. There are also games with the "Open World" tag that are usually at very large scales, as they have to incentivise players to explore a world that has to be large enough. Lastly, there are the "Multiplayer", "Online Co-Op", "VR", "PvP" and "Co-op" tags, which would require external software, such as servers and VR headsets, in order to run.
Therefore, these graphs and tags show that in order to combat the lack of resources and manpower, there are some patterns that emerge among indie games, such as making gameplay more unique or interesting, as well as sticking to types of games that are easier to develop over other types that can be more difficult to develop.
xxxxxxxxxxTo summarise, indie game developers face many restrictions and limitations when developing their games due to a lack of resources and manpower. One such restriction is not being able to produce types of games that are more difficult to produce and implement, which can be difficult due to requiring hyperrealistic graphics and gameplay, requiring complex and in-depth game mechanics to create intersting gameplay and keep the player hooked, being too large in scale, or requiring external software, such as servers and VR headsets to run. However, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. Some of these ways include making gameplay more unique and interesting, either through procedural generation or through difficult or fast-paced gameplay that requires time to master, or by making games that are simpler to develop. As a result, even with the difference in resources and manpower, indie games can still be of the same quality and non-indie games, perhaps even of a higher quality. Indie games can be just as enjoyable and positively received as non-indie games.To summarise, indie game developers face many restrictions and limitations when developing their games due to a lack of resources and manpower. One such restriction is not being able to produce types of games that are more difficult to produce and implement, which can be difficult due to requiring hyperrealistic graphics and gameplay, requiring complex and in-depth game mechanics to create intersting gameplay and keep the player hooked, being too large in scale, or requiring external software, such as servers and VR headsets to run. However, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. Some of these ways include making gameplay more unique and interesting, either through procedural generation or through difficult or fast-paced gameplay that requires time to master, or by making games that are simpler to develop. As a result, even with the difference in resources and manpower, indie games can still be of the same quality and non-indie games, perhaps even of a higher quality. Indie games can be just as enjoyable and positively received as non-indie games.
xxxxxxxxxx## Q3: What factors contribute to the success of an indie game? <a id="Q3"></a>xxxxxxxxxxIf a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie games of different popularities to compare how accessible indie games of different popularities are, as well as to give a general gaugue on the quality of indie games of different popularities.If a game is of a higher quality and is made more accessible for players, the developers of the game would provide players with more language options within their game. Therefore, we can compare the distribution of the number of languages of indie games of different popularities to compare how accessible indie games of different popularities are, as well as to give a general gaugue on the quality of indie games of different popularities.
xxxxxxxxxxplt.figure(figsize=(20, 10))sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Highest"], x="languages_count", label="Highest")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "High"], x="languages_count", label="High")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Low"], x="languages_count", label="Low")sns.kdeplot(data=indie_df.loc[indie_df.owners_binned == "Lowest"], x="languages_count", label="Lowest")plt.title("Distribution of the number of languages of indie games of different popularities")plt.xlabel("Number of Languages")plt.legend()plt.show()xxxxxxxxxxAs popularity increases, the density of indie games with less than 5 languages decreases, while the density of indie games when the number of languages is 5 or greater increases. While "High", "Low" and "Lowest" had similar shapes, having a peak at 1 language before decreasing in density as the number of languages increases, "Highest" had a completely different shape, having a much more distributed shape with a maximum density at around 10 languages.Therefore, we can conclude that overall, as popularity increases, the number of languages increases. This implies that indie games that are more popular would be of a higher quality and are more accessible for players.As popularity increases, the density of indie games with less than 5 languages decreases, while the density of indie games when the number of languages is 5 or greater increases. While "High", "Low" and "Lowest" had similar shapes, having a peak at 1 language before decreasing in density as the number of languages increases, "Highest" had a completely different shape, having a much more distributed shape with a maximum density at around 10 languages.
Therefore, we can conclude that overall, as popularity increases, the number of languages increases. This implies that indie games that are more popular would be of a higher quality and are more accessible for players.
xxxxxxxxxxWe can find how the rating, playtime, price, number of languages and number of developers of a game affects the popularity of a game. To do this, we can find the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity.We can find how the rating, playtime, price, number of languages and number of developers of a game affects the popularity of a game. To do this, we can find the correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity.
xxxxxxxxxxcolumns = ["rating", "avg_playtime", "price", "languages_count"]bins = ["Highest", "High", "Low", "Lowest"]pearson_coef_dict = {}for col in columns: pearson_coef_row = [] for b in bins: pearson_coef, p_value = stats.pearsonr(indie_df.loc[indie_df.owners_binned == b].dropna()[col], indie_df.loc[indie_df.owners_binned == b].dropna()["total_reviews"]) pearson_coef_row.append(pearson_coef) pearson_coef_dict[col] = pearson_coef_rowpd.DataFrame(pearson_coef_dict, index=bins).plot(kind="line", figsize=(20, 10), xlabel="Bins", ylabel="Correlation Coefficient", title="Correlation coefficient that rating, average playtime, price, number of languages have with the total number of reviews at different levels of popularity")plt.legend(["Rating", "Average Playtime", "Price", "Number of Languages"])plt.show()xxxxxxxxxxIn the "Lowest" bin, the correlation coefficients of all 4 variables were low, showing that the correlation of these variables might not be as strong with the least popular games.From the "Highest" bin to the "Low" bin, average playtime had a decreasing trend, price had an increasing trend, and rating and number of languages did not have any obvious trend. Average playtime had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are more popular. This can possibly be due to how if every game is high in quality, a successful game has to have more content and be larger in scale for it to stand out against its competition and succeed. On the other hand, price had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are less popular. This can possibly be due to how if a game is lacking in terms of content and quality, as a lot of less popular games are, the price would be the biggest factor when deciding which game players will purchase and which games will succeed, whereas if a game had more content and higher in quality, as a lot of more popular games are, any reasonable enough price would be suitable. Finally, rating and the number of languages had a constant correlation with the total number of reviews and the popularity if a game does not affect how much they contribute to the popularity of a game. This can be due to how the quality and accessibility of a game are useful factors in creating a succesful game, regardless how popular the game is.We can confirm this by training linear regression models and see if they fit the data. The models are trained in groups of 20, with random_state ranging from 0 to 19, and the best models are picked by hand.In the "Lowest" bin, the correlation coefficients of all 4 variables were low, showing that the correlation of these variables might not be as strong with the least popular games.
From the "Highest" bin to the "Low" bin, average playtime had a decreasing trend, price had an increasing trend, and rating and number of languages did not have any obvious trend. Average playtime had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are more popular. This can possibly be due to how if every game is high in quality, a successful game has to have more content and be larger in scale for it to stand out against its competition and succeed. On the other hand, price had a stronger correlation with the total number of reviews and contributes more to the popularity of a game for games that are less popular. This can possibly be due to how if a game is lacking in terms of content and quality, as a lot of less popular games are, the price would be the biggest factor when deciding which game players will purchase and which games will succeed, whereas if a game had more content and higher in quality, as a lot of more popular games are, any reasonable enough price would be suitable. Finally, rating and the number of languages had a constant correlation with the total number of reviews and the popularity if a game does not affect how much they contribute to the popularity of a game. This can be due to how the quality and accessibility of a game are useful factors in creating a succesful game, regardless how popular the game is.
We can confirm this by training linear regression models and see if they fit the data. The models are trained in groups of 20, with random_state ranging from 0 to 19, and the best models are picked by hand.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.dropna()["total_reviews"], test_size=0.2, random_state=7)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model")plt.xlabel("Total Number of Reviews")plt.text(x=150000, y=0.0003, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=150000, y=0.000275, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=150000, y=0.00025, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=150000, y=0.000225, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,500000], [0,500000], color="red")plt.title("Actual values against values from linear regression model")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxWe can better see the accuracy of the linear regression model by adjusting the axes.We can better see the accuracy of the linear regression model by adjusting the axes.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.dropna()["total_reviews"], test_size=0.2, random_state=7)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model")plt.xlabel("Total Number of Reviews")plt.xlim([-10000, 100000])plt.text(x=70000, y=0.0003, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=70000, y=0.000275, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=70000, y=0.00025, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=70000, y=0.000225, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,45000], [0,45000], color="red")plt.xlim([-1000, 50000])plt.ylim([-1000, 50000])plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. However, the linear regression model does become less accurate for games with a higher number of total reviews, as seen by both the kdeplot and scatterplot. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by average playtime, price and lastly rating.According to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. However, the linear regression model does become less accurate for games with a higher number of total reviews, as seen by both the kdeplot and scatterplot. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by average playtime, price and lastly rating.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Highest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Highest"] .dropna()["total_reviews"], test_size=0.2, random_state=4)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Highest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=450000, y=7e-6, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=450000, y=6.5e-6, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=450000, y=6e-6, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=450000, y=5.5e-6, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,500000], [0,500000], color="red")plt.title("Actual values against values from linear regression model in the \"Highest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. Interestingly, the coefficient of price is negative, compared to the other coefficients that are positive. This makes sense, as the lower the price, the more popular and successful the game would be. The average playtime had the largest magnitude of coefficient and contributes the most to the popularity of a game, followed by number of languages, rating, and lastly price.According to the kdeplot, this linear regression model does somewhat accurately represent the data as the shape of the fitted value is similar to the shape of the actual value. Interestingly, the coefficient of price is negative, compared to the other coefficients that are positive. This makes sense, as the lower the price, the more popular and successful the game would be. The average playtime had the largest magnitude of coefficient and contributes the most to the popularity of a game, followed by number of languages, rating, and lastly price.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "High"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "High"] .dropna()["total_reviews"], test_size=0.2, random_state=10)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"High\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=27500, y=0.000175, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=27500, y=0.0001625, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=27500, y=0.00015, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=27500, y=0.0001375, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,30000], [0,30000], color="red")plt.title("Actual values against values from linear regression model in the \"High\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by price, rating and lastly average playtime.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The number of languages had the largest coefficient and contributes the most to the popularity of a game, followed by price, rating and lastly average playtime.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Low"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Low"] .dropna()["total_reviews"], test_size=0.2, random_state=12)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Low\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=3000, y=0.002, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=3000, y=0.001875, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=3000, y=0.00175, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=3000, y=0.001625, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,4000], [0,4000], color="red")plt.title("Actual values against values from linear regression model in the \"Low\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The price had the largest coefficient and contributes the most to the popularity of a game, followed by number of languages, average playtime and lastly rating.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. However, according to the scatterplot, the linear regression model does follow the overall trend of the data points, thus it can still be used for analysis. The price had the largest coefficient and contributes the most to the popularity of a game, followed by number of languages, average playtime and lastly rating.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()["total_reviews"], test_size=0.2, random_state=18)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Lowest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=5000, y=0.012, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=5000, y=0.0115, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=5000, y=0.011, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=5000, y=0.0105, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,3500], [0,3500], color="red")plt.title("Actual values against values from linear regression model in the \"Lowest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.show()xxxxxxxxxxWe can better see the accuracy of the linear regression model by adjusting the axes.We can better see the accuracy of the linear regression model by adjusting the axes.
xxxxxxxxxxx_train, x_test, y_train, y_test = train_test_split(indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()[["rating", "avg_playtime", "price", "languages_count"]], indie_df.loc[indie_df.owners_binned == "Lowest"] .dropna()["total_reviews"], test_size=0.2, random_state=18)lm = LinearRegression()lm.fit(x_train, y_train)yhat = lm.predict(x_test)plt.figure(figsize=(20, 10))sns.kdeplot(y_test, color='r', label='Actual Value')sns.kdeplot(yhat, color='b', label='Fitted Value')plt.legend()plt.title("Actual distribution against distribution from linear regression model in the \"Lowest\" bin")plt.xlabel("Total Number of Reviews")plt.text(x=700, y=0.012, s=f"Coefficient of rating: {lm.coef_[0]}")plt.text(x=700, y=0.0115, s=f"Coefficient of average playtime: {lm.coef_[1]}")plt.text(x=700, y=0.011, s=f"Coefficient of price: {lm.coef_[2]}")plt.text(x=700, y=0.0105, s=f"Coefficient of number of languages: {lm.coef_[3]}")plt.xlim([-100, 1000])plt.show()plt.figure(figsize=(20, 10))plt.scatter(y_test, yhat)plt.plot([0,900], [0,900], color="red")plt.title("Actual values against values from linear regression model in the \"Lowest\" bin")plt.xlabel('Actual')plt.ylabel('Predicted')plt.xlim([-50, 1000])plt.ylim([-50, 1000])plt.show()xxxxxxxxxxAccording to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. The scatterplot also shows that the predicted values from the linear regression model does not follow the trend of the actual values very well. Therefore, we can conclude that the 4 variables will have a weaker correlation with the number of total reviews and a lesser effect on the success of a game. Therefore, we cannot use this linear regression model for analysis.According to the kdeplot, this linear regression model does not accurately represent the data as the shape of the fitted value is not similar to the shape of the actual value. The scatterplot also shows that the predicted values from the linear regression model does not follow the trend of the actual values very well. Therefore, we can conclude that the 4 variables will have a weaker correlation with the number of total reviews and a lesser effect on the success of a game. Therefore, we cannot use this linear regression model for analysis.
xxxxxxxxxxWe can see that the linear regression plots do in fact follow the trends of the correlation coefficients.The "Lowest" bin has little to none correlation in all 4 variables. Average playtime had a decreasing trend in how it affects the success of the game as popularity decreases, having the highest coefficient out of all of the 4 variables in the "Highest" bin, before having the steepest constant decrease in relative effect on success. Price had an increasing trend in how it affects the success of the game as popularity decreases, having the highest coefficient out of all of the 4 variables in the "Low" bin, after having the steepest constant increase in relative effect on success. Rating and the number of languages did not have any obvious trend in how it affects the success of the game as popularity changes, both having around the same relative effect on success for all 3 bins. We can see that the linear regression plots do in fact follow the trends of the correlation coefficients.
The "Lowest" bin has little to none correlation in all 4 variables. Average playtime had a decreasing trend in how it affects the success of the game as popularity decreases, having the highest coefficient out of all of the 4 variables in the "Highest" bin, before having the steepest constant decrease in relative effect on success. Price had an increasing trend in how it affects the success of the game as popularity decreases, having the highest coefficient out of all of the 4 variables in the "Low" bin, after having the steepest constant increase in relative effect on success. Rating and the number of languages did not have any obvious trend in how it affects the success of the game as popularity changes, both having around the same relative effect on success for all 3 bins.
xxxxxxxxxxTo summarise, the main factor of the success of an indie game is the quality of the game, where in order to be one of the more popular games, indie games would have to have as polished and high-quality as possible. This is shown by how the number of languages of a game and the rating of a game, two general gauges of how quality is an indie game, would have a similar amount of effect on the success of a game regardless of how popular the game is, showing that the quality of an indie game contributes the most to its success, regardless of its popularity. Other factors include the length of the game and how much content it has, which contributes the most to the success of very popular games, as well as the price of the game, which contributes the most to the success of less popular games.To summarise, the main factor of the success of an indie game is the quality of the game, where in order to be one of the more popular games, indie games would have to have as polished and high-quality as possible. This is shown by how the number of languages of a game and the rating of a game, two general gauges of how quality is an indie game, would have a similar amount of effect on the success of a game regardless of how popular the game is, showing that the quality of an indie game contributes the most to its success, regardless of its popularity. Other factors include the length of the game and how much content it has, which contributes the most to the success of very popular games, as well as the price of the game, which contributes the most to the success of less popular games.
xxxxxxxxxx## Q4: What are factors that indie game developers have to consider when developing an indie game? <a id="Q4"></a>xxxxxxxxxxFirstly, we can analyse if developers view their game as just a passion project, or an actual legitimate game. We can do this by plotting the proportion of itch.io games that are free and paid.Firstly, we can analyse if developers view their game as just a passion project, or an actual legitimate game. We can do this by plotting the proportion of itch.io games that are free and paid.
xxxxxxxxxxfree_or_paid = pd.Series({"Free": itchio_df.Price.value_counts()[0.00], "Paid": itchio_df.Price.value_counts().sum()-itchio_df.Price.value_counts()[0.00]})plt.figure(figsize=(7, 7))plt.pie( free_or_paid, labels=free_or_paid.index, autopct='%1.1f%%' )plt.title("Proportion of itch.io games that are free and paid")plt.show()xxxxxxxxxxMajority of the itch.io games are free, thus we can conclude that indie game development is still widely seen as a hobby, rather than an actual career path to make money.We can also plot the proportion of itch.io games by length.Majority of the itch.io games are free, thus we can conclude that indie game development is still widely seen as a hobby, rather than an actual career path to make money.
We can also plot the proportion of itch.io games by length.
xxxxxxxxxxplt.figure(figsize=(7, 7))labels = ["A few seconds", "A few minutes", "About a half-hour", "About an hour", "A few hours", "Days or more"]plt.pie( itchio_df.loc[:, "Average session"].value_counts()[labels], labels=labels, autopct='%1.1f%%' )plt.title("Proportion of itch.io games by length")plt.show()xxxxxxxxxxMost of the itch.io games are very short, with a majority of games only lasting a few minutes, thus we can conclude that most itch.io games are made as short passion projects, either to improve their skills or as a hobby, rather than actual long games with a lot of content.Most of the itch.io games are very short, with a majority of games only lasting a few minutes, thus we can conclude that most itch.io games are made as short passion projects, either to improve their skills or as a hobby, rather than actual long games with a lot of content.
xxxxxxxxxxNext, we can compare the most popular genres and tags of itch.io games to find out what types of itch.io games are being produced.Firstly, we need to find the number of itch.io games that are in each genre.Next, we can compare the most popular genres and tags of itch.io games to find out what types of itch.io games are being produced.
Firstly, we need to find the number of itch.io games that are in each genre.
xxxxxxxxxxgenres_series = pd.Series(itchio_df.Genre.str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()genres_series.plot(kind="bar", figsize=(20, 10), xlabel="Genres", ylabel="Number of Games", title="Number of itch.io games that are in each of the genres")plt.show()xxxxxxxxxxMany of the genres that were popular among Steam indie games were also popular among itch.io games, such as the "Adventure", "Puzzle", "Action" and "Platformer" genres. This confirms that these genres are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.We can also find the number of itch.io games that have each tag.Many of the genres that were popular among Steam indie games were also popular among itch.io games, such as the "Adventure", "Puzzle", "Action" and "Platformer" genres. This confirms that these genres are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.
We can also find the number of itch.io games that have each tag.
xxxxxxxxxxtags_series = pd.Series(itchio_df.Tags.str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()tags_series[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tags", ylabel="Number of Games", title="Number of itch.io games that have each tag")plt.show()xxxxxxxxxxSimilar to the genres, many of the tags that were popular among Steam indie games were also popular among itch.io games, such as the "2D", "Pixel Art", "Singleplayer", "Short", "3D" and "Cute" tags. This confirms that these tags are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.Therefore, we can conclude that the patterns in the types of indie games produced we observed in the Steam games is also consistent in itch.io games, thus the decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on the commercial scale are also applicable in general indie game development, regardless if it is to create a legitimate game or if it is just as a hobby or passion project.Similar to the genres, many of the tags that were popular among Steam indie games were also popular among itch.io games, such as the "2D", "Pixel Art", "Singleplayer", "Short", "3D" and "Cute" tags. This confirms that these tags are truly popular to develop among all indie games, and not just Steam indie games or itch.io indie games.
Therefore, we can conclude that the patterns in the types of indie games produced we observed in the Steam games is also consistent in itch.io games, thus the decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on the commercial scale are also applicable in general indie game development, regardless if it is to create a legitimate game or if it is just as a hobby or passion project.
xxxxxxxxxxWe can analyse how useful are external tools and software in indie game development by plotting the proportion of itch.io games by number of tools and software used. These tools and software are software that can help in the development of indie games in many different areas, such as programming, graphics and sound.We can analyse how useful are external tools and software in indie game development by plotting the proportion of itch.io games by number of tools and software used. These tools and software are software that can help in the development of indie games in many different areas, such as programming, graphics and sound.
xxxxxxxxxxplt.figure(figsize=(7, 7))tool_counts = itchio_df.tools_count.value_counts()tool_counts.iloc[5] = tool_counts.iloc[5:].sum()plt.pie( tool_counts.iloc[0:6], labels=[0, 1, 2, 3, 4, "5+"], autopct='%1.1f%%' )plt.title("Proportion of itch.io games by number of tools and software used")plt.show()xxxxxxxxxxAs the number of tools and software increases, the proportion of itch.io games decreases. However, the proportion of itch.io games that used 1 tool was almost completely equal to the proportion of itch.io games that used no tools, being only 0.1% lesser. There is also a greater proportion of itch.io games that used at least 1 tool compared to itch.io games that used no tools at all. This shows that the use of external tools and software in indie game development is useful among itch.io games and is external tools and software are used quite frequently in indie game development.We can also analyse the most popular tools and software being used in indie game development by finding the top 20 tools and softwares among itch.io games.As the number of tools and software increases, the proportion of itch.io games decreases. However, the proportion of itch.io games that used 1 tool was almost completely equal to the proportion of itch.io games that used no tools, being only 0.1% lesser. There is also a greater proportion of itch.io games that used at least 1 tool compared to itch.io games that used no tools at all. This shows that the use of external tools and software in indie game development is useful among itch.io games and is external tools and software are used quite frequently in indie game development.
We can also analyse the most popular tools and software being used in indie game development by finding the top 20 tools and softwares among itch.io games.
xxxxxxxxxxtools_series = pd.Series(itchio_df.loc[:, "Made with"].str.replace("'", "").str[1: -1].str.split(", ", expand=True).stack().values).value_counts()tools_series[0:20].plot(kind="bar", figsize=(20, 10), xlabel="Tools and Softwares", ylabel="Number of Games", title="Top 20 tools and softwares among itch.io games")plt.show()xxxxxxxxxxUnity was the most popular tool by far, having a far greater number of itch.io games than Bitsy, the tool with the second most games. This makes sense, as Unity is widely considered to be the best game development software for beginners due to how easy it is to use, thus it will be perfect for inexperienced indie game developers.Out of the top 20 tools and softwares, 12 were game engines and were for programming, 6 were tools that can be used to make art assets and were for graphics, and 2 were for creating audio and sound effects. The tools and software for game engines and programming are Unity, Bitsy, RenPy, GameMaker: Studio, Twine, Construct, PICO-8, Godot, RPG Maker, Unreal Engine, OpenFL and PuzzleScript. The tools and software for graphics and art are Adobe Photoshop, Aseprite, Blender, GIMP, Clip Studio Paint and Paint.net. Lastly, the tools and software for audio and sound effects are Audacity and FL Studio.Therefore, we can conclude that tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.Unity was the most popular tool by far, having a far greater number of itch.io games than Bitsy, the tool with the second most games. This makes sense, as Unity is widely considered to be the best game development software for beginners due to how easy it is to use, thus it will be perfect for inexperienced indie game developers.
Out of the top 20 tools and softwares, 12 were game engines and were for programming, 6 were tools that can be used to make art assets and were for graphics, and 2 were for creating audio and sound effects. The tools and software for game engines and programming are Unity, Bitsy, RenPy, GameMaker: Studio, Twine, Construct, PICO-8, Godot, RPG Maker, Unreal Engine, OpenFL and PuzzleScript. The tools and software for graphics and art are Adobe Photoshop, Aseprite, Blender, GIMP, Clip Studio Paint and Paint.net. Lastly, the tools and software for audio and sound effects are Audacity and FL Studio.
Therefore, we can conclude that tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.
xxxxxxxxxxIn summary, one of the factors that indie game developers have to consider is what is the reason they are developing an indie game. If it was for a game marketplace like Steam, then it would most likely be a legitimate game and indie game development would be more of a career than a hobby. However, if it is for a more casual place like itch.io, then it can just be a short passion project, either to improve skills or as a hobby, rather than actual a long game with a lot of content. Indie game development would also be seen as a hobby, rather than an actual career path to make money. However, regardless of what is the reason, the type of game they choose to make and the design choices they make are also very important factors. The decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on are applicable regardless if it is to create a legitimate game that generates income or if it is just as a hobby or passion project. Lastly, choosing what external tools and softwares to use is an important factor. These tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.In summary, one of the factors that indie game developers have to consider is what is the reason they are developing an indie game. If it was for a game marketplace like Steam, then it would most likely be a legitimate game and indie game development would be more of a career than a hobby. However, if it is for a more casual place like itch.io, then it can just be a short passion project, either to improve skills or as a hobby, rather than actual a long game with a lot of content. Indie game development would also be seen as a hobby, rather than an actual career path to make money. However, regardless of what is the reason, the type of game they choose to make and the design choices they make are also very important factors. The decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on are applicable regardless if it is to create a legitimate game that generates income or if it is just as a hobby or passion project. Lastly, choosing what external tools and softwares to use is an important factor. These tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.
xxxxxxxxxxIn conclusion, the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, where more and more players are starting to play indie games as time goes on. This is especially true for less popular games, where the rise in popularity is more rapid. As a result, the demand of indie games has risen exponentially, with many more indie games getting released in the present than in the past. This increasing trend started at around 2008, where the number of indie games released started to grow exponentially. While non-indie games might still have more players than indie games, indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.This could be due to how indie game developers deal with limitations in resources and manpower. Despite the many restrictions and limitations indie game developers face when developing their games due to a lack of resources and manpower, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. As a result, even with the difference in resources and manpower, indie games can still be of the same quality and non-indie games, perhaps even of a higher quality. Indie games can be just as enjoyable and positively received as non-indie games.There are also many factors that contribute to the success of an indie game. The main factor of the success of an indie game is the quality of the game, where in order to be one of the more popular games, indie games would have to have as polished and high-quality as possible. Other factors include the length of the game and how much content it has, which contributes the most to the success of very popular games, as well as the price of the game, which contributes the most to the success of less popular games.In order for indie game developers to achieve this success, they have to consider many factors when developing an indie game. One of the factors that indie game developers have to consider is what is the reason they are developing an indie game. If it was for a game marketplace like Steam, then it would most likely be a legitimate game and indie game development would be more of a career than a hobby. However, if it is for a more casual place like itch.io, then it can just be a short passion project, either to improve skills or as a hobby, rather than actual a long game with a lot of content. Indie game development would also be seen as a hobby, rather than an actual career path to make money. However, regardless of what is the reason, the type of game they choose to make and the design choices they make are also very important factors. The decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on are applicable regardless if it is to create a legitimate game that generates income or if it is just as a hobby or passion project. Lastly, choosing what external tools and softwares to use is an important factor. These tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.In conclusion, the popularity of indie games among players has been on the rise and is catching up to the popularity of non-indie games, where more and more players are starting to play indie games as time goes on. This is especially true for less popular games, where the rise in popularity is more rapid. As a result, the demand of indie games has risen exponentially, with many more indie games getting released in the present than in the past. This increasing trend started at around 2008, where the number of indie games released started to grow exponentially. While non-indie games might still have more players than indie games, indie games are still comparable in scale and popularity to non-indie games, regardless of the level of popularity.
This could be due to how indie game developers deal with limitations in resources and manpower. Despite the many restrictions and limitations indie game developers face when developing their games due to a lack of resources and manpower, indie game developers are able to deal with these limitations in resources and manpower in ingenious ways. As a result, even with the difference in resources and manpower, indie games can still be of the same quality and non-indie games, perhaps even of a higher quality. Indie games can be just as enjoyable and positively received as non-indie games.
There are also many factors that contribute to the success of an indie game. The main factor of the success of an indie game is the quality of the game, where in order to be one of the more popular games, indie games would have to have as polished and high-quality as possible. Other factors include the length of the game and how much content it has, which contributes the most to the success of very popular games, as well as the price of the game, which contributes the most to the success of less popular games.
In order for indie game developers to achieve this success, they have to consider many factors when developing an indie game. One of the factors that indie game developers have to consider is what is the reason they are developing an indie game. If it was for a game marketplace like Steam, then it would most likely be a legitimate game and indie game development would be more of a career than a hobby. However, if it is for a more casual place like itch.io, then it can just be a short passion project, either to improve skills or as a hobby, rather than actual a long game with a lot of content. Indie game development would also be seen as a hobby, rather than an actual career path to make money. However, regardless of what is the reason, the type of game they choose to make and the design choices they make are also very important factors. The decisions and solutions that indie game developers use to overcome the limitations in resources and manpower on are applicable regardless if it is to create a legitimate game that generates income or if it is just as a hobby or passion project. Lastly, choosing what external tools and softwares to use is an important factor. These tools and softwares are very useful in indie game development in many different areas, such as programming, art and audio. This is especially true if the software is easy for indie game developers to use, such as Unity. As a result, external tools and softwares are a very integral part of indie development and many indie games use the to combat the limitations from a lack of resources and manpower.
xxxxxxxxxxOther than Steam games and itch.io games, I could also try to analyse the trends of games submitted to game jams. Game jams are online competitions where contestants are given a short period of time, usually weeks, days or even hours, to create a fully fledged and functioning game. Since contestants have to develop a full game in such a short period of time, there might be more ingenious ways that indie developers overcome the even lesser amount of resources they have available.I could also try finding out the relationships of different tags and genres. Perhaps there could be some combinations of tags or genres that would be much more effective than others. By analysing the combinations of different tags and genres, we can have an even more in-depth analysis on the types of games that indie game developers produce, and would be able to have more meaningful results.Other than Steam games and itch.io games, I could also try to analyse the trends of games submitted to game jams. Game jams are online competitions where contestants are given a short period of time, usually weeks, days or even hours, to create a fully fledged and functioning game. Since contestants have to develop a full game in such a short period of time, there might be more ingenious ways that indie developers overcome the even lesser amount of resources they have available.
I could also try finding out the relationships of different tags and genres. Perhaps there could be some combinations of tags or genres that would be much more effective than others. By analysing the combinations of different tags and genres, we can have an even more in-depth analysis on the types of games that indie game developers produce, and would be able to have more meaningful results.
xxxxxxxxxx<b>Background Information</b>1. https://whatnerd.com/aaa-games-are-getting-worse/2. https://slate.com/technology/2020/12/cyberpunk-2077-bugs-glitches-why-oh-why.html3. https://www.gamedeveloper.com/culture/fallout-76-devs-say-mismanagement-and-crunch-led-to-buggy-launch4. https://www.pcmag.com/opinions/next-gen-aaa-games-creep-toward-70-and-microtransactions-arent-going-anywhere5. https://business.yougov.com/content/41600-us-charting-rise-indie-video-games6. https://whatnerd.com/why-indie-games-surpassing-aaa-titles/<b>Datasets</b>1. https://steamdb.info/stats/gameratings/?all2. https://steamspy.com/api.php3. https://store.steampowered.com/4. https://steamcharts.com/5. https://itch.io/games/top-ratedBackground Information
Datasets